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Preface 



Ten years ago, the inaugural European Conference on Computer Vision was 
held in Antibes, France. Since then, ECCV has been held biennially under the 
auspices of the European Vision Society at venues around Europe. This year, 
the privilege of organizing ECCV 2000 falls to Ireland and it is a signal honour 
for us to host what has become one of the most important events in the calendar 
of the Computer Vision community. 

ECCV is a single-track conference comprising the highest quality, previously 
unpublished, contributed papers on new and original research in computer vision. 
This year, 266 papers were submitted and, following a rigourous double-blind 
review process, with each paper being reviewed by three referees, 116 papers 
were selected by the Programme Committee for presentation at the conference. 

The venue for ECCV 2000 is the University of Dublin, Trinity College. Fo- 
unded in 1592, it is Ireland’s oldest university and has a proud tradition of 
scholarship in the Arts, Humanities, and Sciences, alike. The Trinity campus, 
set in the heart of Dublin, is an an oasis of tranquility and its beautiful squares, 
elegant buildings, and tree-lined playing-fields provide the perfect setting for any 
conference. 

The organization of ECCV 2000 would not have been possible without the 
support of many people. In particular, I wish to thank the Department of Com- 
puter Science, Trinity College, and its Head, Professor J. G. Byrne, for hosting 
the Conference Secretariat. Gerry Lacey, Damian Gordon, Niall Winters, and 
Mary Murray provided unstinting help and assistance whenever it was needed. 
Sarah Campbell and Tony Dempsey in Trinity’s Accommodation Office were a 
continuous source of guidance and advice. I am also indebted to Michael Nowlan 
and his staff in Trinity’s Information Systems Services for hosting the ECCV 
2000 web-site. I am grateful too to the staff of Springer- Verlag for always being 
available to assist with the production of these Proceedings. There are many 
others whose help ~ and forbearance - I would like to acknowledge: my thanks 
to all. 

Support came in other forms too, and it is a pleasure to record here the kind 
generosity of The University of Freiburg, MV Technology Ltd., and Captec Ltd., 
who sponsored prizes for Best Paper awards. 

Finally, a word about conferences. The technical excellence of the scientific 
programme is undoubtedly the most important facet of ECCV. But there are 
other facets to an enjoyable and productive conference, facets which should en- 
gender conviviality, discourse, and interaction; my one wish is that all delegates 
will leave Ireland with great memories, many new friends, and inspirational ideas 
for future research. 



Dublin, April 2000 



David Vernon 
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misardSpa. dec . com 



1 Introduction 

Partitioned sampling is a techniqne which was introduced in [17] for avoiding 
the high cost of particle filters when tracking more than one object. In fact 
this technique can reduce the curse of dimensionality in other situations too. 
This paper describes how to use partitioned sampling on articulated objects, 
obtaining results that would be impossible with standard sampling methods. 
Because partitioned sampling is the statistical analogue of a hierarchical search, 
it makes sense to use it on articulated objects, since links at the base of the 
object can be localised before moving on to search for subsequent links. 

A new concept relating to particle filters, termed the survival rate is intro- 
duced, which sheds light on the efficacy of partitioned sampling. The domain of 
articulated objects also highlights two important features of partitioned sampling 
which are discussed here for the first time: firstly, that the number of particles 
allocated to each partition can be varied to obtain the maximum benefit from a 
fixed computational resource; and secondly, that the number of likelihood eval- 
uations (the most expensive operation in vision-based particle filters) required 
can be halved by taking advantage of the way the likelihood function factorises 
for an articulated object. 

Another important contribution of the paper is the presentation of a vision- 
based “interface-quality” hand tracker: a self-initialising, real-time, robust and 
accurate system of sufficient quality to be used for complex interactive tasks 
such as drawing packages. The tracker models the hand as an articulated object 
and partitioned sampling is the crucial component in achieving these favourable 
properties. The system tracks a user’s hand on an arbitrary background using 
a standard colour camera, in such a way that the hand can be employed as a 
4-dimensional mouse (planar translation and the orientations of the thumb and 
index finger). 

Hand gesture recognition is the subject of much research, for a wide vari- 
ety of applications and by a plethora of methods. Kohler and Schrdter [13] give 
a comprehensive survey. We are not aware of any hand tracking system which 
combines the speed, robustness, accuracy and simple hardware requirements of 
the system described here. Among the more successful systems which recover 
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continuous parameters (rather than recognising gestures from a discrete “vocab- 
ulary”), some use a stereo rig (e.g. [5,20]), some are not real time (e.g. [3,9]), 
while others do not appear to have sufficient accuracy for the applications envis- 
aged here (e.g. [1,2,8,11,12]). Of these, [1,11] are the closest to our system in 
terms of the method used. In both cases, the tracking is good enough to permit 
navigation through a virtual environment, but not for the fine adjustment of 
interactive visual tools (e.g. drawing at pixel accuracy). 

2 Partitioned sampling and the efficiency of particle 
filters 

Partitioned sampling is a way of applying particle filters (also known as the 
Condensation algorithm e.g. [11]) to tracking problems with high-dimensional 
configuration spaces, without incurring the large computational cost that would 
normally be expected in such problems. In this section we first review particle 
filters, then explain why the large computational cost arises, and finally describe 
the basic idea behind partitioned sampling. 



2.1 Particle filters 

Consider a tracking problem with configuration space A" C M'* . Recall that Con- 
densation expresses its belief about the system at time t by approximating the 
posterior probability distribution p(x|Z^), where is the history of observa- 
tions Z'^, . . . Z* made at each time step, and x 6 A". The distribution p(x|2^^) is 
approximated using a weighted partiele set (xj,7rj)^i, which can be interpreted 
as a sum of (5-functions centred on the x* with real, non-negative weights tt* 
(one requires that = 1). Each time step of the Condensation algorithm 

is just an update according to Bayes’ formula, implemented using operations 
on particle sets which can be shown to have the desired effects (as n ->■ oo) 
on the underlying probability distributions. One step of Condensation can be 
conveniently represented on a diagram as follows: 

(p(x|^^-i)) ^ I ~ I ^ < <h(x'|x)> ^ < x /(Z^|x'j > ^ (p(x'|^^)) (1) 

where the ~ symbol denotes resampling, * denotes convolving with dynam- 
ics, and X denotes multiplication by the observation density. Specifically, the 
resampling operation ~ maps (xj,7rj)-hi to (x^,l/n)^i, where each x^ is se- 
lected independently from the the {xi, . . . x„} with probability proportional to 
TTj. This operation has no effect on the distribution represented by the parti- 
cle set, but often helps to improve the efficiency with which it is represented. 
The dynamical convolution operation * maps (xj,7rj)^i to (xJ,7Tj)^i, where x- 
is a random draw from the conditional distribution /i(x'|xj). Its effect on the 
distribution represented by the particle set is to transform a distribution p(x) 
into Jh(x'|x)p(x)dx. Finally, the multiplication operation x maps (xj,7Tj)^i 
to (xj,7r[)^]^, where tt] <x 7ri/(Z*|xj). Its probabilistic effect is to transform a 
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distribution p(x) into the distribution proportional to p(x)/(Z*|x). Hence, the 
overall effect of diagram (1) on the distribution p(x|Z^“^) is to transform it into 
the distribution proportional to /(Z*|x') J/i(x'|x)p(x|Z*“^)dx — precisely the 
Bayes update rule for dynamical diffusion governed by /j(x'|x) and likelihood 
function /(Z^|x'). 

2.2 The survival diagnostic and survival rate 

In assessing the efficacy of particle filters we have found two quantities to be of 
use: the survival diagnostic T> and the survival rate a. The survival diagnostic^ 
is defined for a particle set (xj,7rj)^j as 



( 2 ) 

Intuitively, it can be thought of as indicating the number of particles which would 
survive a resampling operation. Two extreme cases make this clear. If tti = 1 and 
all the other weights are zero, then T> = 1 — only one particle will survive the 
resampling. On the other extreme, if every weight is equal to 1/n, then V = n. 
In this case, every particle would be chosen exactly once by an ideal resampling 
operation, so all n particles would survive.^ Any particle set lies somewhere be- 
tween these two extremes. The survival diagnostic indicates whether tracking 
performance is reliable or not: a low value of T> indicates that estimates (e.g. 
of the mean) based on the particle set may be unreliable, and that there is sig- 
nificant danger of the tracker fosing lock on its target. The difficult problem of 
assessing the performance of particle filters is discussed in the statistical litera- 
ture (e.g. [4, 6, 7, 14]) and no single approach has met with resounding success. 
In our experience, the survival diagnostic is as useful as any other indicator and 
has the significant advantage of having negligible computational cost. 

Whereas the survival diagnostic is a property of a given particle set, the 
survival rate is a property of a given prior p(x) and posterior p'(x). Specifically, 
the survival rate is given by 



a = (/p'(x)Vp(x)dx) . (3) 

(See theorem 2 of [7] for another use of this quantity.) Again, a special case 
is instructive. Suppose p is a uniform distribution on a set A), c A of volume 
Vp, and that p' is also uniform, on a smaller subset Xp' C Xp of volume Vp'. 
Then p' jp is equal to Vp/Vp' everywhere on Xpi , so that a = Vp’ jVp. That is, 

^ Doucet [6] calls it the estimated effective sample size. See also [4], 

® In fact, if truly random resampling is employed, a certain fraction of the particles 
would not survive even in this case. But in practice one uses a deterministic version 
of the resampling operation which selects every particle the appropriate number of 
times. 
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the survival rate is just the ratio of the volume of the posterior to the volume 
of the prior. It turns out that this interpretation is valid in more general cases 
too. Let l(x) = p'(x)/p(x) be the likelihood function and define a particle set 
(xj, 7Tj) which represents p' by letting the Xj be i.i.d. draws from p(x) and setting 
TTj = l(xj). Then it can be shown (see appendix) that for large n, 

V an. (4) 

This explains our terminology: a is called the survival rate because when mul- 
tiplied by n it is approximately the number of particles expected to survive a 
resampling. Hence we expect the overall tracking performance to be related to 
the survival rate at each time step: if the survival rate is too low, the tracker 
will be in danger of producing inaccurate estimates or losing lock altogether. 

Example Figure 1 shows an example of a survival rate a calculated for a contour 
likelihood in a real image. In this particular example, in which the configuration 
space is the one-dimensional interval [—150, 150], the value of a was calculated 
numerically as 0.20. (In more realistic multi-dimensional examples, typical values 
of a are much lower than this.) Equation (4) can also be verified directly by 
simulations for this simple example. 

log-likelihood ratio 




X offset (pixels) 



Fig. 1. Survival rate. A contour likelihood of the kind used in section 3.1 is graphed 
for a range of offsets in the s-direction from a template. Taking a uniform prior p on 
the interval I — [—150, 150], the survival rate a for this particular likelihood function 
can be calculated numerically as 0.20. This corresponds to the “volume” al indicated 
on the graph. 
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2.3 More dimensions means more particles 

The survival rate concept makes it easy to see why particle filters require so many 
extra particles to achieve the same level of performance as the dimension of the 
configuration space increases. An informal argument runs as follows. Fix (by trial 
and error, if necessary) a value 'Dmin which represents the minimum acceptable 
survival diagnostic for successful tracking of a given object with a given steady- 
state prior p(x) on a configuration space A. Then according to (4), we should 
take n > 'Dmin/o! to achieve V > Vmin, where a is the survival rate for this 
particular problem. Now consider tracking two such objects. By the definition 
of the survival rate (3), it is easy to see the survival rate for the two-object 
problem is a^, so that to achieve the same level of tracking performance (i.e. the 
same minimum survival diagnostic) we must take n > V^m Since typically 
a <C 1, this is a substantial additional requirement. Note this does not contradict 
the well-known result that the variance of standard Monte Carlo estimators is 
independent of the dimension of the configurations space. The general recipe of 
“sample from a prior, then weight by a likelihood” can be regarded as a type 
of importance sampling, and it is well-known that importance sampling scales 
badly with dimension. [18] gives a lucid explanation of these phenomena. 

Partitioned sampling essentially eliminates the need for these additional par- 
ticles. The intuition that a is the ratio of the posterior and prior volumes gives 
a hint as to how this problem could be solved. Take the simple case of track- 
ing 2 objects A and B, whose configurations are described respectively by the 
one-dimensional variables xa,xb G [0, 1]. Suppose the survival rate for the one- 
object problem is a: then as remarked above, we have a survival rate a' = 
for the two-object problem. Figure 2 shows a schematic representation of the 
situation. The intuition behind partitioned sampling is that instead of searching 
the entire unit square for the lightly shaded area a', we can divide the search 
into two stages: first, a search of the horizontal axis only, which will attempt to 
populate the dark shaded area a. This step will have survival rate a. Second, we 
try to populate the lightly shaded area. This second step will also have survival 
rate of approximately a, since the relative area of the dark shade to light shade 
is a'/a. This is the key idea behind partitioned sampling. It remains to show 
how we can “populate” certain parts of the configuration space with particles 
in the desired manner. This is done using an operation on particle sets called 
weighted resampling. 



2.4 Weighted resampling 

Let 5 (x) be a strictly positive, continuous function on X called the weight- 
ing function. The weighted resampling function is analogous to the importance 
function used in standard importance sampling [19]. Weighted resampling with 
respect to g is an operation on a particle set which “populates” the peaks of 
g with particles, without altering the distribution actually represented by the 
particle set. Given a particle set (xj,7Ti)”^]^, weighted resampling produces a 
new set (xj,7r[)"_i as follows. First define some “importance” weights pi = 
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Fig. 2. Intuition behind partitioned sampling. To locate the peak of a 2D likeli- 
hood function, which has area a' — a^, the search is split into two stages, each of which 
has survival rate a. The first stage populates the dark shaded area with particles, and 
the second stage populates the light shaded area. 

Z^j=i Next, select indices fci, fe, . . . A:„ by setting ki = j with prob- 

ability pj, independently for i = 1, . . .n. Finally, set xj = and tt[ = TTki/pki- 
This last choice of weights has the effect of precisely counteracting the extent to 
which the particles were “biased” by the importance weights. A proof that the 
weighted resampling operation does not alter the underlying distribution can be 
found in [15]. On a Condensation diagram, the operation of weighted resampling 
with respect to g is denoted ~ g. 

2.5 Partitioned sampling 

Partitioned sampling is a generic term for the strategy which consists of di- 
viding the state space into two or more “partitions”, and sequentially applying 
the dynamics for each partition followed by an appropriate weighted resampling 
operation. For example, the two-object problem described above could be im- 
plemented as the following condensation diagram: 
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where we have assumed the dynamics can be decomposed as 

h(x"|x) = f /iB(x"|x')/iA(x'|x)dx'. ( 6 ) 

J x' 

The algorithm is formally valid for any choice of Ha, fiB satisfying ( 6 ), and for 
any g; the objective of partitioned sampling is to use one’s intuition about the 
problem to choose a decomposition of the dynamics, and a weighting function g, 
which are beneficial. In the example shown in figure 2 , the overall strategy is to 
populate the dark shaded region first, so Ha should be such that some particles 
are diffused into dark region, g should be peaked in the dark region, and Hb 
should be such that particles already in the dark region are not diffused out of 
it. A natural choice, therefore, is to take Ha to be the dynamics for object A, g 
to be a likelihood function for the location of object A only, and hs to be the 
dynamics for object B. This was the approach taken by the authors of [ 17 ]. 



3 Partitioned sampling for articulated objects 

Although the preceding discussion was phrased for clarity in terms of multiple 
objects, partitioned sampling is not restricted to improving the efficiency of 
multiple object tracking. In fact, it can be used whenever the following conditions 
hold. 

• The configuration space X can be partitioned as a Cartesian product X — 
Xi X X2. 

• The dynamics h can be decomposed as h = hi * h2, where /12 acts on X2. 
This means that if x = (xi,X2) and x' = (x^,xy with Xj,x^ 6 Xi, and x' is 
a random draw from /i2(-|x), then x'^ = xi. Informally, the second partition 
of the dynamics does not change the value of the projection of any particle 
into the first partition of the configuration space.^ We refer to this later as 
property (*). 

• A weighting function gi defined on Xi is available, which is peaked in the 
same region as the posterior restricted to Xi . 

There is also an obvious generalisation to A; > 2 partitions: the confignration 
space is partitioned as X = X\ x . . . x X^, the dynamics as h = hi* . . .*hf. with 
each hj acting on Aj x . . . x Xk, and we have weighting functions 51,52, • • -dk-i 
with each gj peaked in the same region as the posterior restricted to Xj. 

One example of such a system is an articulated object. The example given in 
this paper is of a hand tracker which models the fist, index finger and thumb as 
an articulated rigid object with three joints. The partitioned sampling algorithm 

® This condition is stronger than necessary, but a more general discussion would ob- 
scure the important idea. 




10 



J. MacCormick and M. Isard 



used for this application is shown in the following Condensation diagram: 




( 7 ) 



The snbscript ‘f’ stands for “fist”, ‘thT for “first thumb joint”, ‘th2’ for “second 
thumb joint” , and ‘i’ for “index finger” . So the configuration space is partitioned 
into 4 parts: 

• Xf = scale, orientation, and x and y translation of the fist 

• A’thi = joint angle of base of thumb 

• d 4 h 2 = joint angle of tip of thumb 

• Xj = joint angle of index finger 

The dynamics are decomposed as h = ht * /ithi * hth 2 * hi with the last three 
operations consisting of a deterministic shift plus Gaussian diffusion within the 
appropriate partition only. Note that although A" is a shape space of splines, 
it is not described by the linear parameterisation normally used for such shape 
spaces. Instead it is parameterised by the 7 physical variables listed above (scale, 
orientation, x and y translation, and the 3 joint angles), so that any x is an 
element of . 

3.1 Likelihood function and weighting functions 

It remains to specify the measurement likelihood /(Z|x). Recall that the param- 
eters X correspond to a B-spline in the image. A one- dimensional grey-scale edge 
operator is applied to the normal lines to this B-spline at 28 points (8 on the 
main hand, 6 on each of the thumb joints and 8 on the index finger). Each of 
the 28 resulting “edges” (actually points which are the nearest above-threshold 
responses of a ID operator) has a normal distance Vi from the B-spline, which 
would be zero if the model fitted the image edges perfectly. By assuming (i) the 
deviations of the model from the template shape are Gaussian, (ii) that such 
deviations are independent on different normal lines, and (iii) there is a fixed 
probability of finding no edge, it is easy to see that the form of /(Z|x) should 
be 

log/(Z|x) cx const -I- E""’. (8) 

m 

where the constant was set by hand for this application. We can also exploit the 
fact that the portion of a normal line on the interior of the B-spline should be 
skin- coloured. This is reflected by adding to (8) the output of correlating the 
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(colour) normal line pixel values with a colour template. Full details on densities 
of the form (8) can be found in [15, 16], for example. 

Recall there are 28 measurement lines on the hand template: 8 on the fist, 
6 on each of the thumb joints and 8 on the index finger. Since the likelihood 
factorises as a product of likelihoods for individual measurement lines, this gives 
us a convenient way to re-express the likelihood: 

/(Z|x) = /f(Zf|Xf)/thl(Zthl|Xf,Xthl)/th 2 (Zth 2 |Xf,Xthl,Xth 2 )/i(Zi|Xf,Xi) (9) 

where, for example, Zf are the measurements on the 8 fist locations, Xf are the 
components of x which specify the configuration of the fist, and similarly for the 
other subscripts. 

The factorisation (9) immediately suggests the use of /f, /thi and /th 2 a,s 
weighting functions, since they should be peaked at the correct locations of the 
fist and thumb joints respectively. This is precisely what the implementation 
does; hence the presence of ft, /thi and /th 2 on diagram (7). 



3.2 Dividing effort between the partitions 

An important advantage of partitioned sampling is that the number of parti- 
cles devoted to each partition can be varied. Partitions which require a large 
number of particles for acceptable performance can be satisfied without incur- 
ring additional effort in the other partitions. For instance, in the hand tracking 
application, the fist often moves rapidly and unpredictably whereas the joint 
angles of finger and thumb tend to change more slowly. Hence we use rii = 700 
particles for the fist partition, but only n 2 = ns = 100 particles for the two 
thumb partitions and ri 4 = 90 for the index finger partition. A glance at dia- 
gram (7) shows this produces a substantial saving, since at every time-step we 
avoid calculating /thi(Zthi|x), /th 2 (Zth 2 |x) and /i(Zj|x) for over 600 values of x 
that would otherwise have been required. 

Note that the analysis of section 2.3, in terms of survival rates, cannot nec- 
essarily be used to determine the optimum allocation of particles between the 
partitions. If the dynamics and observations in each partition are completely in- 
dependent, and inaccuracies in the estimated parameters for each partition are 
equally costly, then one can show that the number of particles in each partition 
should be inversely proportional to the survival rate for that partition. However, 
these conditions are never satisfied for an articulated object. Indeed, almost the 
opposite is true in the hand-tracking case. For one thing, since the intended ap- 
plication is a drawing tool based on the position of the finger tip, inaccuracies 
of many pixels are acceptable in the fist position, provided only that lock is not 
lost on the finger tip. However, even small errors in the finger tip position will 
degrade the performance of the drawing tool greatly. Thus one might think that 
the majority of particles should be devoted to the finger tip partition. 

Two factors militate against this conclusion, however. One is that the precise 
location of the finger tip is in fact determined by an auxiliary least-squares fitting 
operation mentioned later; hence the imperative for accuracy in this partition is 
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not SO great. Second, it is of overwhelming importance that lock is not lost on 
the fist, because the search lines for locating finger and thumb are placed relative 
to the fist. Experiment showed that this factor is the most crucial, which is why 
the majority of particles are devoted to the fist partition. 

So far we have not been able to develop a coherent theory for choosing how to 
allocate the particles between partitions in such cases, and can only recommend 
careful experimentation. Some insight can be gained by studying simulated data, 
however. Figure 3 shows the results of tracking a simulated articulated chain 
with several links. The state space is divided into one partition for each link, 
and a fixed number of particles was divided between these in various ways. The 
graphs show the variance (in pixels^) of the end-point of the articulated object, 
as estimated by partitioned sampling averaged over 200 frames. Several different 
runs were made for each set of parameter values; the curves shown are the best-fit 
(least-squares) quartics through all data points. Figure 3(a) is for a 3-link object 
whose dynamics have equal variance at each link. A total of 300 particles were 
available; 100 were allocated to the final partition and the remaining particles 
divided between the first two partitions. Because the dynamics and likelihood 
function are the same for each partition, the survival rates are similar for each 
partition, and as we might expect, the minimum variance is achieved by equally 
dividing these particles between the first two partitions. 





(a) equal noise on each link (b) additional noise on first three (of six) links 



Fig. 3. Allocating resources to different partitions, (a) Because the variance of 
the dynamics for each link is equal, the survival rate for each partition is approximately 
the same and the best allocation of particles is to distribute them evenly between parti- 
tions. (b) Now the variance of the dynamics on early partitions is 9 times higher than 
the later ones, so the survival rates on early partitions are lower and it is best to devote 
a higher proportion of the particles to these partitions. 



Figure 3(b) is a more extreme example. Now there are 6 links, and the first 
three links have dynamics which are much “noisier” than the last three links. 
Specifically, the first three links have the same dynamics hi 23 (jx) and the last 
three share a different conditional density /i 456 (-|x) for their dynamics. The den- 
sities hi 23 , /i 456 were Gaussian with var(/ii 23 ) = 9var(/i456). Because of the higher 
variance of the dynamics, the survival rate for particles in the first three parti- 
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tions is lower than those in the last three; hence we expect that it will be most 
efficient to devote the majority of particles to the first three partitions. This is 
indeed the case; from the graph it appears that the best results are achieved 
when 60-70% of the particles are devoted to the first three partitions. Notice 
the extremely high variances for many data points outside this range: these are 
caused by the tracker losing lock on the early partitions. 



3.3 Articulated objects can be evaluated twice as fast 



In the particular case in which the overall likelihood /(Z|x) can be expressed as a 
product (9) of the weighting functions and another easily calculated function (in 
this case, /i), the diagram (7) can be given a simpler form which uses standard 
resampling rather than weighted resampling: 



< */ithl(x"|x'^ ► <xAh^ 

► < */i-th2(x"'|x"^ — 

. ^/ii(x"»|x'»)> ^ <xf^ . 



( 10 ) 



One can check the equivalence by just writing out in detail the algorithm de- 
scribed by each diagram. The key is property (*) mentioned in section 3: e.g. the 
“fist” component Xf of a particle does not change after the fist partition, so the 
value of /f for the particle does not change either. In other words, the evaluation 
of any given importance function commutes with the dynamics from subsequent 
partitions. 

The reformulation of (7) as (10) is important because the computational ex- 
pense of the hand tracking largely resides in evaluating the likelihood functions. 
Using diagram (7), the likelihood of each measurement line (except those on the 
index finger) is evaluated twice — once as part of a weighting function, and once 
as part of the final likelihood function. In diagram (10), each measurement line 
is examined only once. 



3.4 Other details 

Initialisation and re-initialisation are handled by the ICondensation mechanism 
of [11]. Various standard tools, such as background subtraction (which can be 
performed on an SGI Octane very cheaply using the alpha-blending hardware), 
and least-squares fitting of an auxiliary spline to the tip of the index finger, are 
used to refine the performance of the tracker. Details of these tools can be found 
in our technical report [10]. 
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4 Results: a vision-based drawing package 

The hand tracker described in the previous section was implemented on an SGI 
Octane with a single 175MHz RIOOOO CPU. Using 700 samples for the hand 
base, 100 samples for each of the thumb joints and 90 samples for the index 
finger, the tracker consumes approximately 75% of the machine cycles, which 
allows real-time operation at 25Hz with no dropped video frames even while 
other applications are rnnning on the machine. The tracker is robnst to clntter 
(figure 4), including skin-coloured objects (figure 5). The position of the index 
finger is located with considerable precision (figure 5) and the two articnlations 
in the thumb are also recovered with reasonable accuracy (figure 6). 






i 
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Fig. 4. Heavy clutter does not hinder the hand tracker. Even moving the papers on the 
desk to invalidate the background subtraction does not prevent the Condensation tracker 
functioning. The fingertip localisation is less robust, however, and jitter increases in 
heavily cluttered areas. 



We have developed a simple drawing package to explore the utility of a vision- 
based hand tracker for user-interface tasks. The tracking achieved is sufficiently 
good that it can compete with a mouse for freehand drawing, though (currently) 
at the cost of absorbing most of the processing of a moderately powerful worksta- 
tion. It is therefore instructive to consider what additional strengths of the vision 
system we can exploit to provide functionality which could not be reproduced 
using a mouse. 

The current prototype drawing package provides only one primitive, the free- 
hand line. When the thumb is extended, the pointer draws, and when the thumb 
is placed against the hand the virtual pen is lifted from the page. Immediately 
we can exploit one of the extra degrees of freedom estimated by the tracker, 
and use the orientation of the index finger to control the width of the line be- 
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Fig. 5. Left: Skin-coloured objects do not distract the tracker. Here two hands are 
present in the image hut tracking remains fixed to the right hand. If the right hand were 
to leave the field of view the tracker would immediately reinitialise on the left hand. 
Middle and right: The index Anger is tracked rotating relative to the hand body. The 
angle of the finger is estimated with considerable precision, and agile motions of the 
fingertip, such as scribbling gestures, can be accurately recorded. 



ing produced. When the finger points upwards on the image, the pen draws 
with a default width, and as the finger rotates the width varies from thinner 
(finger anti-clockwise) to thicker (finger clockwise) — see figure 7. The scarcity 
of variable-thickness fines in computer-generated artwork is a testament to the 
difficulty of producing this effect with a mouse. 

The fact that a camera is observing the desk also allows other intriguing 
features not directly related to hand-tracking. We have implemented a natural 
interface to translate and rotate the virtual workspace for the modest hardware 
investment of a piece of black paper (figure 8). The very strong white-to-black 
edges from the desk to the paper allow the paper to be tracked with great pre- 
cision using a simple Kalman filter, at low computational cost. Translations and 
rotations of the paper are then reflected in the virtual workspace, a very sat- 
isfying interface paradigm. While one hand draws, the other hand can adjust 
the workspace to the most comfortable position. Figure 9 is a still from a movie 
which shows the system in action; this movie is available at [15]. In the future 
it should be possible to perform discrete operations such as switching between 
drawing tools using simple static gesture recognition on one of the hands. Track- 
ing both hands would allow more complex selection tasks, for example continuous 
zooming, or colour picking. 



5 Conclusion 

It has been shown that the technique of partitioned sampling can be applied 
to articulated objects. A new concept termed the “survival rate” of particles in 
a particle filter was used to explain why partitioned sampling works, and some 
special features of the application to articulated objects were exploited for signif- 
icant computational improvements. Although some progress has been made, the 
question of how to allocate a fixed number of particles between partitions has 
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F 6. The two degrees c freedom of the thumb are tracked. The thumb anc fi.? 
are not very reliably estimated. This is probably partly because the joints are short, and 
so offer few edges to detect, and more importantly because the shape model gives a poor 
approximation to the thumb when it opposes. The gross position of the thumb can be 
extracted consistently enough to provide a stable switch which can be used analogously 
to a mouse button. 



not been answered coherently and this must be the subject of future work. An- 
other open problem, not previously mentioned, is that our current “articulated 
partitioned” approach takes no account of the tree structure of the object: every 
link must be sampled as a chain even though the physical structure is a tree. 
Our present approach is valid mathematically, but it would be more appropriate, 
and possibly more efficient, to take account of the tree structure. 

A hand-tracking system using partitioned sampling on articulated objects 
was described. It is of sufficient quality for very demanding interactive tasks. 
The main features of the system are robustness (from the Condensation algo- 
rithm), instantaneous initialisation and near-perfect responsiveness (from im- 
portance sampling based on colour segmentation) and inexpensive addition of 
extra degrees of freedom (from partitioned sampling). The system runs on a 
single-processor workstation with a standard colour camera and no additional 
hardware. Even in the simple drawing package described it is easily possible to 
produce figures which could not comfortably be produced with a mouse, and 
to do so using natural gestures and a natural, changing desk environment. We 
believe this system has significant implications for the everyday use of virtual 
environments with interactive computer vision. 



A Appendix 



An informal proof of (4) follows; more details can be found in [15]. Recall the 
scenario of section 2.2: a particle set (xj,7Tj)]Li hs-s been formed with prior (or 
“proposal density”) p(x) and weighted by likelihood p'(x)/p(x), resulting in a 
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Fig. 7. Line thickness is controlled using the orientation of the index finger. The 
top image shows a line drawn with the index finger pointing to the left, producing a 
thin trace. In the bottom image the finger pointed to the right and the line is fatter. Of 
course if the finger angle varies while the line is being drawn, a continuous variation 
of thickness is produced. 



posterior p'(x). Some simple calculations give 

V = ^ by definition of V, equation (2) 

by defn of the tt*, and comment below 
^ (n f ^ the x-”^ are drawn from p(x) 

= (/p'(x)VKx)dx)“^ xn 
-- an 



The second line uses the fact (see [15]) that for large n, the normalisation con- 
stant for the weights is approximately 1/n — so tt* w p'(xj)/(np(xj)). 
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Fig. 8. Moving ^tround the virtual workspace is accomplished by following the 
tracked outline of a physical object. The piece of black paper can be tracked with a 
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Fig. 9. The drawing package in action. This screen shot shows a cartoon character 
drawn using the drawing package; note the variable- width lines. A movie clip showing 
this figure being created is available at [15]. The scene on the left is the camera’s view; 
it is shown for information only and is not employed by the user. 
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Abstract. This paper describes a highly flexible approach to real-time 
frame-rate tracking in complex camera and structures configurations, 
including the use of multiple cameras and the tracking of multiple or 
articulated targets. A powerful and general method is presented for ex- 
pressing and solving the constraints which exist in these configurations 
in a principled manner. This method exploits the geometric structure 
present in the Lie group and Lie algebra formalism to express the con- 
straints that derive from structures such as hinges or a common ground 
plane. This method makes use of the adjoint representation to simplify 
the constraints which are then applied by means of Lagrange multipliers. 



1 Introduction 

The tracking of known three-dimensional objects is useful for numerous appli- 
cations, including motion analysis, surveillance and robotic control tasks. This 
paper presents a novel approach to visual tracking in complex camera and struc- 
ture configurations, including the use of multiple cameras and the tracking of 
multiple structures with constraints or of articulated structures. Earlier work in 
the tracking of rigid bodies which employs a Lie group and Lie algebra for- 
malism is exploited in order to simplify the difficulties that arise in these more 
complex situations and thus provide a real-time frame-rate tracking system. 

The paper first reviews work on the tracking of rigid bodies and then de- 
scribes the Lie group and Lie algebra formalism used within the rigid body 
tracking system which is used as the basis for more complex configurations. It 
then shows how this formalism provides a powerful means of managing complex 
multi-component configurations; the transformation of measurements made in 
differing co-ordinate frames is simplified as is the expression of constraints (e.g. 
hinge or slide) that are present in the system. These constraints can then be im- 
posed by means of Lagrange multipliers. Results from experiments with real-time 
frame-rate systems using this framework are then presented and discussed. 



1.1 Model-Based Tracking 

Because a video feed contains a very large amount of data, it is important to 
extract only a small amount of salient information if real-time frame (or field) 
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rate performance is to be achieved |2|. This observation leads to the notion of 
feature based tracking [3| in which processing is restricted to locating strong 
image features such as contours m 

A number of successful systems have been based on tracking the image con- 
tours of a known model. Lowe 0 used the Marr-Hildreth edge detector to extract 
edges from the image which were then chained together to form lines. These li- 
nes were matched and fitted to those in the model. A similar approach using the 
Hough transform has also been used |3 . The use of two-dimensional image pro- 
cessing incurs a significant computational cost and both of these systems make 
use of special purpose hardware in order to achieve frame rate processing. 



An alternative approach is to render the model first and then use sparse 
one-dimensional search to find and measure the distance to matching (nearby) 
edges in the image. This approach has been used in RAPID 0 , Condensation 
0 and other systems |inillll2( . The efficiency yielded by this approach allows 
all these systems to run in real-time on standard workstations. The approach is 
also used here. 



Using either of these approaches, most systems (except Condensation) then 
compute the pose parameters by linearising with respect to image motion. This 
process is reformulated here in terms of the Lie group SE(3) and its Lie alge- 
bra (see fl3ll4| for a good introduction to Lie groups and their algebras). This 
formulation is a natural one to use since SE(3) exactly represents the space of 
poses that form the output of a system which tracks a rigid body. Differential 
quantities such as velocities and small motions in the group then correspond to 
the Lie algebra of the group (which is the tangent space to the identity). Thus 
the representation provides a canonical method for linearising the relationship 
between image motion and pose parameters. Further, this approach can be ge- 
neralised to other transformation groups and has been successfully applied to 
deformations of a planar contour using the groups GA(2) and P(2) jIS|. 

Outliers are a key problem that must be addressed by systems which measure 
and fit edges. They frequently occur in the measurement process since additional 
edges may be present in the scene in close proximity to the model edges. These 
may be caused by shadows, for example, or strong background scene elements. 
Such outliers are a particular problem for the traditional least-squares fitting 
method used by many of the algorithms. Methods of improving robustness to 
these sorts of outliers include the use of RANSAC [m, factored sampling 0 
or regularisation, for example the Levenberg-Marquadt scheme used in 0. The 
approach used here employs iterative re- weighted least squares (a robust M- 
estimator) . 

There is a trade-off to be made between robustness and precision. The Con- 
densation system, for example, obtains a high degree of robustness by taking a 
large number of sample hypotheses of the position of the tracked structure with 
a comparatively small number of edge measurements per sample. By contrast, 
the system presented here uses a large number of measurements for a single po- 
sition hypothesis and is thus able to obtain very high precision in its positional 
estimates. This is particularly relevant in tasks such as visual servoing since the 
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dynamics and environmental conditions can be controlled so as to constrain the 
robustness problems, while high precision is needed in real-time in order for the 
system to be useful. 

Occlusion is also a significant cause of instabilities and may occur when the 
object occludes parts of itself (self occlusion) or where another object lies between 
the camera and the target (external occlusion) . RAPID handles the first of these 
problems by use of a pre-computed table of visible features indexed by what is 
essentially a view-sphere. By contrast, the system presented here uses graphical 
rendering techniques HH to dynamically determine the visible features and is 
thus able to handle more complex situations (such as objects with holes) than 
can be tabulated on a view-sphere. 

External occlusion can be treated by using outlier rejection, for example 
in m which discards primitives for which insufficient support is found, or by 
modifying statistical descriptions of the observation model (as in UHl). If a model 
is available for the intervening object, then it is possible to use this to re-estimate 
the visible features PC]. Both of these methods are used within the system 
presented here. 



1.2 Articulated Structures 

A taxonomy of non-rigid motion is given in m- This paper is only concerned 
with what is classified as articulated motion, which can be characterised as com- 
prising rigid components connected by simple structures such as hinges, slides 
etc. 

Lowe m also considered articulated motion, which was implemented by 
means of internal model parameters which are stored in a tree structure. By 
contrast, the approach presented here uses a symmetric representation in which 
the full pose of each rigid component is stored independently. Constraints are 
then imposed on the relationships between component pose estimates. A similar 
approach has been taken for tracking people PI which relies on prior extraction 
of accurate silhouettes in multiple synchronised views of each frame which are 
then used to apply forces on the components of the three dimensional model. 



2 Tracking a Rigid Structure in a Single View 

This section will review the rigid body tracking system which is used as a basis 
for the extensions which are presented in this paper. The approach used here for 
tracking a known 3-dimensional structure is based upon maintaining an estimate 
of the camera projection matrix, P, in the co-ordinate system of the structure. 
This projection matrix is represented as the product of a matrix of internal 
camera parameters: 



K = 



' fu S UQ- 
0 fv Vo 

L 0 0 1 J 



( 1 ) 
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and a Euclidean projection matrix representing the position and orientation of 
the camera relative to the target structure: 

E=[Rt] with RR^ = I and |i?| = 1 (2) 



The projective co-ordinates of an image feature are then given by 




with the actual image co-ordinates given by 

(u \ — ( “/“'l 

V ) v/w j 



(3) 

(4) 



Rigid motions of the camera relative to the target structure between conse- 
cutive video frames can then be represented by right multiplication of the pro- 
jection matrix by a Euclidean transformation of the form: 



M = 



R t 
0 0 0 1 



(5) 



These M, form a 4 x 4 matrix representation of the group SE(3) of rigid 
body motions in 3-dimensional space, which is a 6-dimensional Lie Group. The 
generators of this group are typically taken to be translations in the x, y and z 
directions and rotations about the x, y and z axes, represented by the following 
matrices: 
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These generators form a basis for the vector space (the Lie algebra) of de- 
rivatives of SE(3) at the identity. Group elements can be obtained from the 
generators via the exponential map: 



M = exp(aiGi) 



(7) 



Thus, if M represents the transformation of the structure between two adjacent 
video frames, then the task of the tracking system becomes that of finding the 
Ui that describe the inter-frame transformation. Since the motion will be small, 
M can be approximated by the linear terms: 

M « / -k (8) 



Gonsequently, the motion is approximately a linear sum of that produced by 
each of the generators. The partial derivative of projective image co-ordinates 
with respect the zth generating motion can be computed as: 




(9) 
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Fig. 1. Computing the normal component of the motion and generator vector fields 



with 




giving the motion in true image co-ordinates. A least-squares approach can then 
be used to fit the observed motion of image features between adjacent frames. 
This process is detailed in Section m\ 

The features used in this work for tracking are the edges that are present 
in the model. These are strong features that can be reliably found in the image 
because they have a significant spatial extent. Furthermore, this means that 
a number of measurements can be made along each edge, and thus they may 
be accurately localised within an image. This choice also makes it possible to 
take advantage of the aperture problem (that the component of motion of an 
edge, tangent to itself, is not observable locally), since it allows the use of one- 
dimensional search along the edge normal (see Figure ^). The normal component 
of the motion fields, Li are then also computed (as fi = Li ■ fi) and d can be 
fitted as a linear combination of the projections of the fi. 

In order to track the edges of the model as lines in the image, it is necessary 
to determine which (parts of) lines are visible at each frame and where they are 
located relative to the camera. This work uses binary space partition trees Hg 
to dynamically determine the visible features of the model in real-time. This 
technique allows accurate frame rate tracking of complex structures such as the 
ship part shown in Figure El As rendering takes place, the stencil buffer is used to 
locate the visible parts of each edge by querying the buffer at a series of points 
along the edge prior to drawing the edge. Where the line is visible, tracking 
nodes are assigned to search for the nearest intensity discontinuity in the video 
feed along the edge normal (see Figure Ej). 

Figure 0 shows system operation. At each cycle, the system renders the ex- 
pected view of the object (a) using its current estimate of the projection matrix, 
P. The visible edges are identified and tracking nodes are assigned at regular 
intervals in image co-ordinates along these edges (b). The edge normal is then 
searched in the video feed for a nearby edge (c). Typically m 400 nodes are 
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Fig. 2. Image and CAD model of ship part 



assigned and measurements made in this way. The system then projects this 
m-dimensional measurement vector onto the 6-dimensional subspace correspon- 
ding to Euclidean transformations (d) giving the least squares estimate of the 
motion, M. The Euclidean part of the projection matrix, E is then updated 
by right multiplication with this transformation (e). Finally, the new projection 
matrix P is obtained by multiplying the camera parameters K with the updated 
Euclidean matrix to give a new current estimate of the local position (f). The 
system then loops back to step (a). 



2.1 Computing the Motion 

Step (d) in the process involves the projection of the measurement vector onto 
the subspace defined by the Euclidean transformation group. This subspace is 
given by the ff which describe the magnitude of the edge normal motion that 
would be observed in the image at the node for the i**' group generator. 
These can be considered as a set of m-dimensional vectors which describe the 
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Fig. 4. Tracking nodes are assigned and distances measured 



motion in the image for each mode of Euclidean transformation. The system then 
projects the m- vector corresponding to the measured distances (to the observed 
edges) onto the subspace spanned by the transformation vectors. The geometric 
transformation of the part which best fits the observed edge positions can be 
found by minimising the square error between the transformed edge position and 
the actual edge position (in pixels). This process is performed as follows: 

Vi = '^Sf^ ( 11 ) 

= (12) 

a, = C-/v, (13) 

(with Einstein summation convention over Latin indices used throughout this 
paper). It can be seen that setting f3i = oti gives the minimum (least-squares) 
solution to 

.5 = - Ml? (14) 

€ 

since ^ = “2 E (1^) 

and setting /3i = ai and substituting dnj gives 

= -2 E /f E (16) 

A"* C e 

= -2 E(/^^^) + 2^bC'7fc' E /f (17) 

= 0 



(18) 
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Fig. 5. Frames from video of tracking sequence: The CAD model of the ship part is 
superimposed on the video image using the estimate of the projection matrix. 



The ai thus define a linear approximation to the Euclidean motion which mi- 
nimises the sum squared error between the model and the observed lines. When 
more complex configurations are examined, it becomes important to consider 
how the sum squared error varies when Pi ^ ai. Setting Pi = at + Si, m gives 

g = 0 + 2^/f/|£, (19) 

= ( 20 ) 

and integrating gives 

S = So + EiCijEj where So = (21) 

All that remains for the rigid body tracker is to compute the matrix for the 
motion of the model represented by the ai and apply it to the matrix E in (0 
which is done by using the exponential map. 

Et+i = Et exp(E,a,G,) (22) 

The system is therefore able to maintain an estimate of E (and hence P) 
by continually computing the coefficients ai of inter-frame motions (see Figure 
El . This method has also been extended to include the motion of image features 
due to the change in internal camera parameters and thus provide a method for 
on-line camera calibration 1231. In practice the simple least squares algorithm is 
not robust to outliers so the terms in m and m are reweighted by a decaying 
function of to obtain a robust M-estimator. The reweighting causes the algo- 
rithm to become iterative (since varies with each iteration) but convergence 
in all but extreme conditions is very fast and only one iteration is used per video 
frame/field. 

3 Complex Configurations 

The rigid body tracking system presented in the previous section is now used 
as the basis of an approach which is designed to operate in more complex con- 
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figurations. A novel framework for constructing tracking systems within these 
configurations is now presented, which takes advantage of the formulation and 
computational operation of the rigid body tracker. Such configurations arise in 
a number of ways, 

Multiple cameras: It is often desirable to use more than one camera to obtain 
information about a scene since multiple view configurations can provide 
higher pose precision (especially when a large baseline is used) and also 
increase the robustness of the tracker. 

Multiple targets: There are many situations in which knowing the relations- 
hip between the camera and a single target is insufficient. This occurs par- 
ticularly when the position of the camera is not of direct interest. In these 
situations, it is often desirable to measure the relationship between two or 
more targets that are present in the scene, for example between two vehicles 
and the road, or between a robot tool and its workpiece. 

Articulated targets: Many targets of interest are not simple rigid bodies, but 
contain internal degrees of freedom. This work is restricted to considering 
targets which comprise a number of rigid components connected by hinges 
or slides etc. 

The simplest way to handle these configurations is merely to run multiple 
instances of the rigid body tracker concurrently, one per component per camera. 
Thus, for example three cameras viewing two structures would require six con- 
current trackers. Unfortunately, this naive approach can introduce many more 
degrees of freedom into the system than are really present. In this example, even 
if the cameras and structures can move independently, there are only 24 degrees 
of freedom in the world, whereas the system of six trackers contains 36. In ge- 
neral, this is a bad thing since problems such as ill-conditioning and high search 
complexity are more prevalent in high dimensional systems and also because the 
solution thus generated can exhibit inconsistencies. The natural approach to this 
problem is to impose all of the constraints that are known about the world upon 
the tracking system. 



4 Applying Constraints 

Multiple Cameras: In the case in which multiple cameras are used to view 
a scene, it may be that the cameras are known to be rigid relative to one 
another in space. In this case, there are 6 constraints that can be imposed 
on the system for every camera additional to the first. 

Multiple structures: Where the system is being used to track multiple struc- 
tures, it is often the case that other constraints apply between the structures. 
For example two cars will share a common ground-plane, and thus a system 
in which two vehicles observed from an airborne camera will have three con- 
straints that apply to the raw twelve dimensions present in the two trackers, 
reflecting the nine degrees of freedom present in the world. 
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Articulated structures: This is really a special case of constrained multiple 
structures, except that there are usually more constraints. A hinged struc- 
ture, for example has seven degrees of freedom (six for position in the world 
and one for the angle of the hinge) . When the two components of the struc- 
ture are independently tracked, there are five hinge constraints which apply 
to the system. 

Because these constraints exist in the world, it is highly desirable to impose 
them on the system of trackers. Each of the trackers generates an estimate for 
the motion of one rigid component in a given view, ai in (I I dll as well as a matrix 
Cij in (^21) which describes how the error varies around that estimate. Thus the 
goal is to use both of these pieces of information from each tracker to obtain 
a global maximum a-posteriori estimate of the motion subject to satisfying the 
known constraints. This raises three issues which must be addressed: 

1. Measurements from different trackers are made in different co-ordinate frames. 

2. How can the constraints be expressed? 

3. How can they then be imposed? 

4.1 Co-ordinate Frames 

The first difficulty is that the ai and the Cij are quantities in the Lie algebra 
deriving from the co-ordinate frame of the object being tracked. Since these 
are not the same, in general, for distinct trackers, a method for transforming 
the ai and Cij from one co-ordinate frame to another is needed. Specifically, 
this requires knowing what happens to the Lie algebra of SE(3) under co- 
ordinate frame changes. Since these frame changes correspond to elements of the 
Lie group SE(3), this reduces to knowing what happens to the Lie algebra of 
the group under conjugation by elements of the group. This is (by definition) 
the adjoint representation of the group which is a n x n matrix representation, 
where n is the dimensionality of the group (six in the case of SE(3)). The adjoint 
representation, ad(M), for a matrix element of SE(3), M, can easily be computed 
by considering the action of M on the group generators, G^, by conjugation: 

MGiM~^ = ad(M)y Gj- (23) 

3 

If (with a slight abuse of notation) M = this is given by 

where = Sijktk (24) 

To see that these 6x6 matrices do form a representation of SE(3), it is only 
necessary to ensure that multiplication is preserved under the mapping into the 
adjoint space (that ad(Mi)ad(M 2 ) = ad(MiM 2 )) which can easily be checked 
using the identity Ri[t 2 A]Ri^ = [RihA]- Thus if M transforms points from co- 
ordinate frame 1 into frame 2, then ad(M) transforms a vector in the Lie algebra 



ad(M) = 



R 0 
[tA]R R 
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of frame 1 into the Lie algebra of frame 2. Using this, the quantities in equations 
(HU - (H21) can be transformed as follows (see Figure EKa-b)): 



4.2 Expressing Constraints 



It is useful to have a generic method for expressing the constraints that are 
present on the given world configuration since this increases the speed with which 
models for new situations may be constructed. In the Lie algebra formalism, it 
is very easy to express the constraints that describe a hinge, a slide or the 
existence of a common ground plane since the relationship between the motion 
in the algebra and the constraints is a simple one. 

The presence of a hinge or common ground plane are holonomic constraints 
which reduce the dimensionality of the configuration space by five and three re- 
spectively. This results in a seven or nine dimensional sub-manifold representing 
legal configurations embedded within the raw twelve dimensional configuration 
manifold. The tangent space to this submanifold corresponds to the space of ve- 
locities which respect the constraint. This means that at each legal configuration 
there is a linear subspace of legal velocities, which implies that the constraints on 
the velocities must be both linear and homogeneous (since zero velocity results 
in a legal configuration) . Thus if f3i and /?2 correspond to the motions of the two 
rigid components (in their Lie algebras) then the constraints must take the form 



There must be five such ci and C 2 for the hinge or three for the common 
ground plane. As a simple example, consider the case of a hinge in which the 
axis of rotation passes through the origin of component I’s co-ordinate frame 
and lies along its 2 axis. When the motions of the two parts are considered in 
I’s frame, then their translations along all three axes must be the same as must 
their rotations about the x and y axes; only their rotations about the z axis 
can differ. Since component 2’s motion can be transformed into I’s co-ordinate 
frame using the adjoint representation of the co-ordinate transformation, the 
constraints now take the form 



a' = ad(M) a 
C" = ad(M) C'ad(M)'^ 
v' = ad(M)“^u 



(25) 

(26) 
(27) 



/3i -c\ + (32- c\ = Q 



(28) 



/3l ■ c) -|- /?2 • C2 — 0 



(29) 
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where /?2 = a.d{E-y ^i?2)/32 is the motion of component 2 in I’s frame. In this 
example, the C\ and C 2 vectors for the five constraints become particularly simple: 
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with c| = —c\. In the case of a common ground plane in I’s x-y plane, only 
constraints 3, 4 and 5 are needed. If the hinge or ground plane are placed else- 
where then the adjoint representation can be used to transform the constraints 
by considering a Euclidean transformation that takes this situation back to the 
simple one. 



4.3 Imposing the Constraints 

Since the constraints have a particularly simple form, finding the optimal (3\ 
and (3'2 is also an easy matter. This is done by modifying the least-squares fitting 
procedure used for the single tracker, which is adapted so that the motion which 
gives the least square error subject to satisfying the constraints is found. Given 
the a and C computed in imil - lli;ill . then (EH) gives the increase in sum squared 
error if the motion j3 is used in place of a as {ft — a)C{!3 — a). Thus, given the 
independent solutions for the two motions (ai,Ci) and the aim is to 

find and /?2 such that 

(/3i - oi)Ci(/?i - oi) + (/?' - a')C'(/3' - a' ) (31) 

is minimised subject to 

/3i -C\ + P'2- 0^2 = 0 (32) 

This is a constrained optimisation problem and ideal for solving by means of 
Lagrange multipliers. Thus the solution is given by the constraints in (I32II and 

V((/3i - aifCiiPi - ai) + (/?' - a')^C'(/3' - o')) + ^^V{pfc\ + /3f c^) = 0 

(33) 



with V running over the twelve dimensions of (|gl). This evaluates to 



[2Ci{Pi 

\2C2{P'2 




= 0 



(34) 



Pi — ai — ^Xic\ 

p '2 = a '2 - lC' 2 -^XA 



Thus 

and 



( 35 ) 
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Fig. 6. Applying the constraints: Estimates and errors are computed for motions 1 and 
2 (a), the estimate and error of motion 2 are mapped into I’s co-ordinate frame (b), 
the constraint is applied there (c) and then the new estimate of motion 2 is mapped 
back into its own frame (d). 



Substituting (13211 back into 113311 gives 

cl • ai + cl • al - iA, (cl • Cf + cl • Cl" V^)) = 0 (36) 

So the Xi are given by 

A,, = cl • Cf Vi + cl • Of ^4 (37) 

k = 2 (cl • ai -b cl • al) (38) 

A, = A-^H, (39) 

The Xi can then be substituted back into (I5S1) to obtain /3i and /?! (see Figure 
Elc)), from which P 2 can also be obtained by P 2 = ad(i ?2 ^7fi)/31 (see Figure 
m)- The f3 can then be used to update the configurations of the two rigid 
parts of the hinged structure giving the configuration with the least square error 
that also satisfies the constraints. 

5 Results 

A system was developed to test the tracking of a simple articulated structure 
(shown in Figure Cl)a)). This system operates in real-time at PAL frame-rate 
(25Hz) on an SGI 02 (225 MHz RlOK). The structure consists of two com- 
ponents, each 15cm square, joined along one edge by a hinge. This structure 

is a difficult one to track since there are barely enough degrees of freedom in 
the image of the structure to constrain the parameters of the model. A series 
of experiments were conducted to examine the precision with which the system 
can estimate the angle between parts of the model with and without the hinge 
constraints imposed. The hinge of the part was oriented at a series of known 
angles and for each angle a set of measurements were taken with and without 
the constraints imposed. The amount by which the rotational and translational 
constraints (measured at the hinge) are violated by the unconstrained tracker 
was also measured. 
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130° 
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In all cases, the estimate produced by the constrained tracker was within 1° 
of the ground truth. The unconstrained (12 DoF) tracker was much less accu- 
rate in general, and also reported substantial errors in violation of the known 
constraints. The variance in the angle estimate gives an indication of the stabi- 
lity of the tracker and it can be seen that the use of constraints improves this 
significantly. Figure Q(b) shows the behaviour of the unconstrained tracker. Be- 
cause of the difficulty in finding the central crease, this tracker becomes weakly 
conditioned and noise fitting can introduce large errors. 

This system was then extended to track the structure with an additional 
square component and hinge (see Figure 0a)). The system is able to track the 
full configuration of the structure, even when the central component is fully 
hidden from view (see Figure EKb)). In this case, the observed positions of the 
two visible components are sufficient to determine the location of the hidden 
part. Further, the indirect constraints between the two end parts of the structure 
serve to improve the conditioning of the estimation of their positions. 

A system was also developed to show that constraints of intermediate com- 
plexity such as the existence of a common ground plane can be implemented 
within this framework. The system can dynamically impose or relax the com- 
mon ground plane constraint. The objects to be tracked are shown in Figure 
0a) and Figure 0(b) shows how the tracker behaves when the constraint is de- 




Fig. 7. Hinge tracking with and without constraints 
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Fig. 8. Double hinge structure: The tracker can infer the position of a hidden compo- 
nent from the constraints 



liberately violated; the output of the system still respects the constraint and is 
forced to find a compromise between the two components. 

Finally, a multi-camera system was developed using three cameras multiple- 
xed using the red, green and blue components of a 4:2:2 digital signal to track 
the pose of a rigid structure (the ship part). With 3 cameras operating simulta- 
neously (on a complex structure) the achieved frame rate dropped to 20Hz (this 
is believed to be due to speed limitations of the GL rendering hardware used in 
the tracking cycle. This 3 camera configuration is found to be much more stable 
and robust, maintaining a track over sequences that have been found to cause 
the single camera tracker to fall into a local minimum. These instabilities occur 
in a sparse set of configurations (e.g. when a feature rich plane passes through 
the camera and also in near-affine conditions when such a plane is fronto-parallel 




Fig. 9. Two structures with common ground plane constraint: When the world violates 
the constraint, the tracker attempts to fit the constrained model. In this example, the 
tracker has fitted some parts of both models 
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to the camera). By employing multiple cameras it becomes extremely difficult 
to contrive a situation that is critical in all camera views simultaneously. 

6 Conclusion 

The use of Lie algebras for representing differential quantities within a rigid 
body tracker has facilitated the construction of systems which operate in more 
complex and constrained configurations. Within this representation, it is easy to 
transform rigid body tracking information between co-ordinate frames using the 
adjoint representation, and also to express and impose the constraints correspon- 
ding to the presence of hinges or a common ground plane. This yields benefits in 
terms of ease of programming and implementation, which in turn make it readily 
possible to achieve real-time frame rate performance using standard hardware. 
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Abstract. This paper presents a prototype system for pedestrian detec- 
tion on-board a moving vehicle. The system uses a generic two-step ap- 
proach for efficient object detection. In the first step, contour features are 
used in a hierarchical template matching approach to efficiently ’’lock” 
onto candidate solutions. Shape matching is based on Distance Trans- 
forms. By capturing the objects shape variability by means of a template 
hierarchy and using a combined coarse-to-fine approach in shape and pa- 
rameter space, this method achieves very large speed-ups compared to 
a brute-force method. We have measured gains of several orders of ma- 
gnitude. The second step utilizes the richer set of intensity features in 
a pattern classification approach to verify the candidate solutions (i.e. 
using Radial Basis Functions). We present experimental results on pede- 
strian detection off-line and on-board our Urban Traffic Assistant vehicle 
and discuss the challenges that lie ahead. 



1 Introduction 

We are developing vision-based systems for driver assistance on-board vehicles 
I?]. Safety and ease-of-use of vehicles are the two central themes in this line of 
work. This paper focusses on the safety aspect and presents a prototype system 
for the detection of the most vulnerable traffic participants: pedestrians. To 
illustrate the magnitude of the problem, consider the numbers for Germany: 
more than 40.000 pedestrians were injured in 1996 alone due to collisions with 
vehicles |^. Of these, more than 1000 were fatal injuries. Our long-term goal is to 
develop systems which, if not avoid these accidents altogether, at least minimize 
their severity by employing protective measures in case of upcoming collisions. 

An extensive amount of computer vision work exists in the area of ’’Looking- 
at-People”, see |H| for a recent survey. The pedestrian application on-board ve- 
hicles is particulary difficult for a number of reasons. The objects of interest 
appear in highly cluttered backgrounds and have a wide range of appearances, 
due to body size and poses, clothing and outdoor lighting conditions. They stand 
typically relatively far away from the camera, and thus appears rather small in 
the image, at low resolution. A major complication is that because of the moving 
vehicle, one does not have the luxury to use simple background subtraction me- 
thods to obtain a foreground region containing the human. Furthermore, there 
are hard real-time requirements for the vehicle application which rule out any 
brute-force approaches. 



D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. .37-Eol. 2000. 
© Springer- Verlag Berlin Heidelberg 2000 



38 



D.M. Gavrila 



The outline of this paper is as follows. After reviewing past work on pede- 
strian detection, in Section |2l we present an efficient two-step approach to this 
problem. The Chamfer System, a system for shape-based object detection based 
on multi- feature hierarchical template matching, is described in Section 0 The 
following Section 0 deals with a Radial Basis Function (RBF)-based verification 
method employed to dismiss false-positives. Special measures are taken to obtain 
a ’’high-quality” training set. Section lists the experiments on pedestrian de- 
tection; it is followed by a discussion of the challenges that lie ahead, in Section 
El We conclude in Section O 



2 Previous Work 

Most work on pedestrian detection has taken a learning-based approach, 
bypassing a pose recovery step altogether and describing human appearance in 
terms of simple low-level features from a region of interest. One line of work has 
dealt specifically with scenes involving people walking laterally to the viewing 
direction. Periodicity has provided a quite powerful cue for this task, either 
derived from optical flow HH or raw pixel data jS]. Heisele and Wohler mu 
describe ways to learn the characteristic gait pattern using a Time-Delay Neural 
Network with local receptive fields; their method is not based on periodicity 
detection and extends to arbitrary motion patterns. 

A crucial factor determining the success of the previous learning methods is 
the availability of a good foreground region. Standard background subtraction 
techniques are of little avail because of a moving camera; here, independent mo- 
tion detection techniques can help ini, although they are difficult to develop, 
themselves. Yet, given a correct initial foreground region, some of the burden 
can be shifted to tracking. For example, work by Baumberg and Hogg |2| ap- 
plied Active Shape Models, based on B-splines, for tracking pedestrians. The 
interesting feature of this approach is that the Active Shape Models only deform 
in a way consistent with the training set; they can be combined with scale-space 
matching techniques to increase their coverage in image space |2| . In other work 
pnj . color clusters are tracked over time; a pre-selection technique is used to 
identify the clusters that might correspond to the legs. Work by Curio et al. 
m uses a general- purpose tracker based on the Hausdorff distance to track the 
edges of the legs. Rigoll, Winterstein and Muller m perform Kalman filtering 
on a HMM-based representation of pedestrians. 

A complementary problem is to detect pedestrians whilst they stand still. 
A system that can detect pedestrians in static images is described in HS|. It 
basically shifts windows of various sizes over the image, extracts an overcomplete 
set of wavelet features from the current window, and applies a Support Vector 
Machine (SVM) classifier to determine whether a pedestrian is present or not. 

The proposed system is, like dSI, applied on pedestrian detection in static 
images. However, the brute-force window sliding technique used there is not 
feasible for real-time vision onboard vehicles, because of the large computational 
cost involved. We propose a shape-based system that does not require a region 
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of interest, yet can very quickly ’’lock” onto desired objects, using an efficient 
coarse-to-fine technique based on distance transforms. The pattern classification 
approach is only applied at the second stage, for verification, allowing realtime 
performance. The resulting system is generic can been applied to other object 
recognition tasks as well. 

3 Detection: The Chamfer System 

We now discuss the basics and extensions of the Chamfer System, a system for 
realtime shape-based object detection. 

3.1 Basics 

At the core of the proposed system lies shape matching using distance transforms 
(DT) dH. Consider the problem of detecting pedestrians in an image (Figure 
El). Various object appearances are modeled with templates such as in Figure 
E>. Matching template T and image / involves computing the feature image of 
/, (Figure dt) and applying a distance transform to obtain a DT-image (Figure 

DT). 

A distance transform converts a binary image, which consists of feature and 
non-feature pixels, into an image where each pixel value denotes the distance to 
the nearest feature pixel. A variety of DT algorithms exist, differing in their use 
of a particular distance metric and the way local distances are propagated. The 
chamfer transform, for example, computes an approximation of the Euclidean 
distance using integer arithmetic, typically in raster-scan fashion Q. 

After computing the distance transform, the relevant template T is trans- 
formed (e.g. translated) and positioned over the resulting DT image of /; the 
matching measure D(T, I) is determined by the pixel values of the DT image 
which lie under the ”on” pixels of the transformed template. These pixel values 
form a distribution of distances of the template features to the nearest features 
in the image. The lower these distances are, the better the match between image 
and template at this location. There are a number of matching measures that can 
be defined on the distance distribution; one possibility is to use simple averaging. 
Other more robust (and costly) measures reduce the effect of missing features 
(i.e. due to occlusion or segmentation errors) by using the average truncated 
distance or the /-th quantile value (the H aus dor ff distance), e.g. 

For efficiency purposes, we use in our work the average chamfer distance 

Dchamfer{T, I) = ^ ^ dl(t) (1) 

I I t€T 

where \T\ denotes the number of features in T and d/(t) denotes the chamfer 
distance between feature t in T and the closest feature in I. 

In applications, a template is considered matched at locations where the 
distance measure D{T,I) is below a user-supplied threshold 9 

D{T,I) < 9 



(2) 
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(c) 



(d) 



Fig. 1. (a) original image (b) template (c) edge image (d) DT image 



The advantage of matching a template with the DT image rather than with 
the edge image is that the resulting similarity measure will be smoother as a 
function of the template transformation parameters. This enables the use of an 
efficient search algorithm to lock onto the correct solution, as will be discussed 
shortly. It also allows some degree of dissimilarity between a template and an 
object of interest in the image. 



3.2 Extensions 

The main contribution of the Chamfer System is the use of a template hierarchy 
to efficiently match whole sets of templates. These templates can be geometrical 
transformations of a reference template, or, more general, be examples capturing 
the set of appearances of an object of interest (e.g. pedestrian). The underlying 
idea is to derive a representation off-line which exploits any structure in this 
template distribution, so that, on-line, matching can proceed optimized. More 
specifically, the aim is to group similar templates together and represent them 
two entities: a ’’prototype” template and a distance parameter. The latter needs 
to capture the dissimilarity between the prototype template and the templates 
it represents. By matching the prototype with the images, rather than the indi- 
vidual templates, a typically significant speed-up can be achieved on-line. When 
applied recursively, this grouping leads to template hierarchy, see Figure 0 
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□ 











Fig. 2. A hierarchy for pedestrian shapes (partial view) 



The above ideas are put into practice as follows. Offline, a template hierarchy 
is generated automatically from available example templates. The proposed algo- 
rithm uses a bottom-up approach and applies a partitional clustering algorithm 
at each level of the hierarchy. The input to the algorithm is a set of templates 
ti, ..., t^v, their dissimilarity matrix (see below) and the desired partition size K. 
The output is the Al-partition and the prototype templates pi,...,px for each 
of the K groups S\, Sk- The AT- way clustering is achieved by iterative opti- 
mization. Starting with an initial (random) partition, templates are moved back 
and forth between groups while the following objective function E is minimized 



Here, Z?(ti,pJ) denotes the distance measure between the z-th element of group 
k and the prototype for that group at the current iteration, pj. The distance 
measure is the same as the one used for matching (e.g. chamfer or Hausdorff 
distance). Entry D(i,j) is the ijth member of the dissimilarity matrix, which 
can be computed fully before grouping or only on demand. 

One way of choosing the prototype pj is to select the template with the 
smallest maximum distance to the other templates. A low E- value is desirable 
since it implies a tight grouping; this lowers the distance threshold that will be 
required during matching (see also Equation Ej) which in turn likely decreases 
the number of locations which one needs to consider during matching. Simulated 
annealing is used to perform the minimization of E. 

Online, matching can be seen as traversing the tree structure of templates. 
Each node corresponds to matching a (prototype) template p with the image at 



K 




( 3 ) 
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some particular locations. For the locations where the distance measure between 
template and image is below a user-supplied threshold Op, one computes new 
interest locations for the children nodes (generated by sampling the local neigh- 
borhood with a finer grid) and adds the children nodes to the list of nodes to be 
processed. For locations where the distance measure is above the threshold, se- 
arch does not propagate to the sub-tree; it is this pruning capability that brings 
large efficiency gains. Initially, the matching process starts at the root and the 
interest locations lie on a uniform grid over relevant regions in the image. The 
tree can be traversed in breadth-first or depth-first fashion. In the experiments, 
we use depth-first traversal, which has the advantage that one needs to maintain 
only L — 1 sets of interest locations, with L the number of levels of the tree. 

Let p be the template corresponding to the node currently processed during 
traversal at level I and let C = {ti,...,tc} be the set of templates correspon- 
ding to its children nodes. Let Sp be the maximum distance between p and the 
elements of C. 

Sp = maxD(p,t,) (4) 

Let a I be the size of the underlying uniform grid at level I in grid units, and let /i 
denote the distance along the diagonal of a single unit grid element. Furthermore, 
let Ttoi denote the allowed shape dissimilarity value between template and image 
at a “correct” location. Then by having 

Op = Ttoi + (5p -I- (5) 

one has the desirable property that, using untruncated distance measures such 
as the chamfer distance, one can assure that the coarse-to-fine approach using 
the template hierarchy will not miss a solution. The thresholds one obtains by 
Equation © are very conservative, in practice one can use lower thresholds to 
speed up matching, at the cost of possibly missing a solution (see Experiments) . 

4 Verification: RBF-Based Pattern Classification 

As result of the initial detection step, we obtain a (possibly empty) set of can- 
didate solutions. The latter are described by a template id and the particular 
image location where the match was found. The verification step consists of revi- 
siting the original image, extracting a rectangular window region corresponding 
to the bounding box of the template matched, normalizing the window for scale, 
and employing a local approximator based on Radial Basis Functions (RBFs) 
PI to classify the resulting M x N pixel values. 

While training the RBF classifier, RBF centers are set in feature space by 
an agglomerative clustering procedure applied on the available training data. 
Linear ramps, rather than Gaussians, are used as radial functions, for efficiency 
purposes. Two radius parameters specify each such ramp, the radius where the 
ramp initiates (descending from the maximum probability value) and the radius 
where the ramp is cut off (after which probability value is set 0). These para- 
meters are set based on the distance to the nearest reference vector of the same 
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class and to that of the nearest reference vector of one of the other classes, in a 
manner described in d- The recall stage of the RBF classifier consists of sum- 
ming probabilities that an unknown feature vector corresponds to a particular 
class, based on the contributions made by the various RBF centers. 

One quickly realizes that the two classes involved (i.e. pedestrian and non- 
pedestrian) have quite different properties. The pedestrian class is comparably 
well localized in feature space, while the non-pedestrian class is wide spread-out. 
Our aim is to accurately model the target class, the pedestrians, while mapping 
the vast region of non-pedestrian is both impractical and unnecessary. The only 
instances of the non-pedestrian class really needed are those which lie close to 
the imaginary border with the target class. In order to find these, an incremental 
bootstrapping procedure is used, similar to d This procedure adapts at each 
iteration the RBF classifier based on its performance of a new batch of no-target 
data. It only adds the non-target class examples which were classified incorrectly 
to the training set; then, it retrains the RBF classifier. 

We take incremental bootstrapping a step further and integrate the detection 
system into the loop, reflecting the actual system coupling between detection 
and verification. Each batch of new non-target data is thus prefiltered by the 
detection unit, which will introduce a useful additional bias towards samples 
close to the imaginary target vs. non-target border in feature space. 

5 Experiments 

Experiments with pedestrian detection were performed off-line as well as on- 
board the Urban Traffic Assistant (UTA) demo vehicle. 

We compiled a database of about 1250 distinct pedestrian shapes at a given 
scale; this number doubled when mirroring the templates across the y-axis. On 
this set of templates, an initial four-level pedestrian hierarchy was built, follo- 
wing the method described in the previous Section. In order to obtain a more 
compact representation of the shape distribution and provide some means for 
generalization, the leaf level was discarded, resulting in the three-level hierarchy 
used for matching (e.g. Figure^ with about 900 templates at the new leaf level, 
per scale. Five scales were used, with range 70-102 pixels. 

A number of implementation choices improved the performance and robustn- 
ess of the Chamfer System, e.g. the use oriented edge features, template subsam- 
pling, multi-stage edge segmentation thresholds and ground plane constraints. 
Applying SIMD processing (MMX) to the main bottlenecks of the system, di- 
stance transform computation and correlation, resulted in a speed-up of factor 
3-4. See 0. 

Our preliminary experiments on a dataset of 900 images with no significant 
occlusion (distinct from the sequences used for training) showed detection rates 
in the 60-90 % range using the Chamfer System alone. With this setting, we ob- 
tained a handful of false detections solutions per image, of which approximately 
90 % were rejected by the RBF classifier, at a cost of falsely rejecting 15 % of 
the pedestrians correctly detected by the Chamfer System. 
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Figure 0 illustrates some candidate solutions generated by the Chamfer Sy- 
stem. Figure 0shows intermediate results; matches at various levels of the tem- 
plate hierarchy are illustrated in white, grey and black for the first, second 
and leaf level, respectively. We undertook various statistics on our dataset, one 
of which is shown in Figure 0 It shows the cumulative distribution of average 
chamfer distance values on the path from the root to the ’’correct” leaf template. 
The correct leaf template was chosen as the one among the training examples to 
be most similar with the shape labeled by the human for a particular image. It 
was FigureO, rather than Equation o, that was used to determine the distance 
thresholds at the nodes of the template hierarchy. For example, from Figure 0 
it follows that by having distance thresholds of 5.5, 4.1 and 3.1 for nodes at the 
first, second and leaf level of the hierarchy, each level passes through about 80% 
of the correct solutions. Figure 0 provides in essence an indication of the quality 
of the hierarchical template representation (i.e. how well the templates at the 
leaf level represent the shape distribution and good the clustering process is) . 

In general, given image width W, image height H , and K templates, a brute- 
force matching algorithm would require WxHxK correlations between template 
and image. In the presented hierarchical approach both factors W x H and K 
are pruned (by a coarse-to-fine approach in image space and in template space) . 
It is not possible to provide an analytical expression for the speed-up, because 
it depends on the actual image data and template distribution. Nevertheless, for 
this pedestrian application, we measured speed-ups of three orders of magnitude. 

The Urban Traffic Assistant (UTA) vehicle (Figure0) is the DaimlerChrysler 
testbed for driver assistance in the urban environment 0. It showcases the 
broader Intelligent Stop & Go function, i.e. the capability to visually ’’lock” onto 
a leading vehicle and autonomously follow it, while detecting relevant elements 
of the traffic infrastructure (e.g. lane boundaries, traffic signs, traffic lights). 
Detected objects are visualized in a 3-D graphical world in a way that mimicks 
the configuration in the real world. See Figure 0i. The pedestrian module is a 
recent addition to UTA. It is being tested on traffic situations such as shown 
in Figure 0 where, suddenly, a pedestrian crosses the street. If the pedestrian 
module is used in isolation, the system runs at approximately 1 Hz on a dual- 
Pentium 450 MHz with MMX; 3-D information can be derived from the flat- 
world assumption. In the alternate mode of operation the stereo-module in UTA 
is used to provide a region of interest for the Chamfer System; this enables a 
processing speed of about 3 Hz. 

For updated results (including video clips) the reader is referred to the aut- 
hor’s WWW site www.gavrila.net. 



6 Discussion 

Though we have been quite successful! with the current prototype pedestrian 
system, evidently, we only stand at the beginning of solving the problem with 
the degree of reliability necessary to actually deploy such a system. A number 
of issues remain open in the current system. Starting with the Chamfer System, 
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even though it uses a multi-stage edge segmentation technique, matching is still 
dependend on a reasonable contour segmentation. Furthermore, the proposed 
template-based technique will not be very suitable for detecting pedestrians very 
close to the camera. Currently, a multi-modal shape tracker is being developed 
(i.e. d) to integrate results over time and improve overall detection perfor- 
mance; single-image detection rates of 50% might not be problematic after all. 
Regarding the verification stage, the choice for a RBF classifier is probably not 
a determining factor; it would be indeed interesting to compare its performance 
with that of a Support Vector Machine HS|. 

The experiments indicated that detection performance varied considerably 
over parts of our database, according to the degree of contrast. Once the database 
is extended to include partially occluded pedestrians, or pedestrians at night, 
this variability is only going to increase, increasing the challenge how to report 
the detection performance in a representative manner. Also, larger test sets will 
be needed; we will have an enlarged pedestrian database of 5000 images with 
ground truth (i.e. labeled pedestrian shapes) in the near future. 



7 Conclusions 

This paper presented a working prototype system for pedestrian detection on- 
board a moving vehicle. The system used a generic two-step approach for effi- 
cient object detection. The first step involved contour features and a hierarchical 
template matching approach to efficiently ’’lock” onto candidate solutions. The 
second step utilized the richer set of intensity features in a pattern classification 
approach to verify the candidate solutions (i.e. using Radial Basis Functions). We 
found that this combined approach was able to deliver quite promising results for 
the difficult problem of pedestrian detection. With further work on (e.g. tempo- 
ral integration of results, integration with stereo/IR) we hope to come closer to 
the demanding performance rates that might be required for actual deployment 
of such a system. 
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(b) 



Fig. 7 . The Urban Traffic Assistant (UTA) demonstration vehicle: (a) inside and (b) 
outside view 




Fig. 8. A potentially dangerous traffic situation: a pedestrian suddenly crossing the 
street 
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Abstract. In this paper, we propose a method using stereo vision for 
visually guiding and controlling a robot in projective three-space. Our 
formulation is entirely projective. Metric models are not required and 
are replaced with projective models of both the stereo geometry and the 
robot’s “projective kinematics”. Such models are preferable since they 
can be identified from the vision data without any a-priori knowledge. 
More precisely, we present constraints on projective space that reflect the 
visibility and mobility underlying a given task. Using interaction matrix 
that relates articulation space to projective space, we decompose the 
task into three elementary components: a translation and two rotations. 
This allows us to define trajectories that are both visually and globally 
feasible, i.e. problems like self-occlusion, local minima, and divergent 
control no longer exist. In this paper, we will not adopt a straight-foward 
image-based trajectory tracking. Instead, a directly computed control 
that combines a feed-forward steering loop with a feed-back control loop, 
based on the Cartesian error of each of the task’s components. 



1 Introduction 

The robot vision problem has driven much research in computer vision, but alt- 
hough many approaches have been proposed, visual servoing has not yet made 
the step from ’’the labs to the fabs” and scientific progress is still being made. 
Changes in the way the system is modelled are currently stimulating such pro- 
gress. 

Most position-based approaches are based on CAD models and precise ca- 
libration of cameras and robots. Open- loop control is then sufficient for global 
operation on tasks defined in workspace. In contract, image-based approaches 
0 are based on approximate local linear models of the robot-image interaction, 
so closed-loop control allows local operations on tasks defined in image space. 
These classical approaches are essentially based on, respectively, geometric and 
differential metric models of the robot vision system. 

Recent research in computer vision has made significant progress in modeling 
multi-camera systems, thanks to the use of projective geometry. One of the 
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most interesting results is three-dimensional projective reconstruction jH| based 
on “uncalibrated” stereo vision 0: The stereo camera provides an instantaneous 
representation of depth, 3D structure and 3D motion, but only up-to a projective 
transformation. As such representations are independent of metric geometry, a 
prior metric calibration and CAD-models can sometimes be dispensed with. 

Most research in this area, although this has turn out to be difficult, focuses 
on recovering additionally the metric calibration Only few researchers PI 
have proposed robot vision systems that are based on non-metric models. In 
particular, very little work tries to use the projective stereo rigs directly, despite 
their appeal as dynamic sensors for 3D structure and motion |Z]. 

In this paper, we study such a projective robot vision system presented re- 
cently PU . Although, the effectiveness in image-based visual servoing has already 
been demonstrated H2!, here we exploit the 3D capabilities of stereo and for- 
mulate a directly computed control in projective space. This allows us to over- 
come the most important problems of the image-based approach P), namely: 
self-occlusion, local minima, and lack of global convergence. 

Overview of the Paper and on the State-of-the-Art 

In section |2| we sketch the background of our approach. Consult for full 
detail. In sectional we define mobility constraints in projective space, including 
several 1-dof motions - “visual” and hence “virtual” mechanisms - which later 
are used to formulate parameterized trajectory functions. Previous work, such 
as 12], considers only a single camera and thus has to use Rouleaux surfaces 
as constraints, i.e. a cylinder for a revolution or a prism for a translation. In 
contrast, considering stereo and projective three-space allows virtual mechanisms 
to be defined from a minimal number of constraints on very simple primitives, 
e.g. two 3D-points for a revolution to be defined. 

In section 2] we calculate such constraints for a 6-dof reaching-task and de- 
compose it into three visual mechanisms which respect the mobility and visibility 
of the task. This construction relies on local information on the interaction bet- 
ween joint-space and projective space but not on position-based information. 
Parametric trajectory functions are then defined to describe the desired rea- 
ching motion. In previous work on visual servoingpp , trajectory generation is 
often explicit and related to camera-space, whereas task-space would be more 
appropriate. Furthermore, it relies heavily on metric knowledge. In subsection 
O we describe how visibility of the faces of a tool-object implies constraints on 
the trajectory parameters. Most previous work on visibility uses local reactive 
methods in order to avoid image borders or obstacles. In contrast, we consider 
the often neglected but important problem of object self-occlusion and obtain 
occlusion-free trajectories in closed-form. 

In section 0 we describe a directly computed control, consisting of a feed- 
forward part, which guides the motion along a globally valid and visually feasible 
trajectory, and a feed-back part, which drives a Cartesian configuration-error to 
zero. Recent research shows a tendency towards integrating 3D- or pose infor- 
mation into the initially image-based approaches The aim is a control-error 
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that no longer reflects a linear image motion, but a 3D rigid motion j1 n| . In 
current approaches, the calculation of such Cartesian errors is independent from 
control calculation and does not use the interaction matrix. Hence, they rely 
on a metric calibration which, if it is a coarse one, will hardly affect stability, 
but will affect trajectories and will degrade performance. In our approach, the 
3-dof Cartesian error and the direct control are the result of one and the same 
calculation based on the interaction matrix and on the trajectory constraints. 
Moreover, our projective formalism is independent of metric system parameters 
and works with the most general model, i.e. with an interaction matrix relating 
robot joint-space to projective space. 

Finally, in section El we present experiments using simulations based on real 
data. We demonstrate the efficiency of our method in a classical benchmark 
and evaluate the performance. Notation. Bold type H, T is used for matrices, 
bold italic A, a for vectors, and Roman a, v, 9, k\ for real numbers, scale factors, 
angles, coefficients, etc. Column vectors are written as A, fe, and row vectors as 
the transpose aJ , hJ , where uppercase stand for spatial points, and lowercase 
a, m for planes or image points. 



2 Preliminaries 



Stereo Vision in Projective Space. Given two pinhole cameras that have 
constant intrinstic parameters and that are rigidly mounted onto a stereo rig. 
Their epipolar geometry is constant and allows a pair of 3 x 4 projection matrices 
P,P' to be defined 0. Then, the left and right images m,m' G of a 3D 
Euclidean point N have a reconstruction M G in projective space which is 
related to a Euclidean one by the 3D homography UpE and an unknown scalar 
p in each point: 



C m 
Cm' 
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P' 



1 6x4 

M, 



X" 
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L 1 J JV 
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1 4x4 
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(1) 



The 4x4 matrix Hpg is constant and contains the unknown calibration, the 
affine one in the infinity-plane (a^ 1), and the metric one in the (left) intrinsic 
parameters K, upper-triangular. Implicitly, this defines a projective frame in 
which the reconstruction is done and which can be imagined as five points rigidly 
linked with the stereo head. 

An object undergoing in Euclidean space the displacement Art appears in 
projective space to undergo the conjugate projective motion 'Rrt Q), a 4 x 4 
homography well-defined from at least five object-points M' = C^^LreM . We 
will always normalize them to det(H/jT) = 1, as det(T/{T) = 1, and call them 
“projective displacement”: 



H/jt = 7 Art ALre, detH/jT = 1, be. 7 = 1. (2) 



This conjugacy to the Lie group SE{3) allows a corresponding conjugate Lie 
algebra to be defined, whose elements are denoted by H/jt, while Art denote 
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those of the Lie algebra se(3). 

Hat = T_rt Hp£;, exp(HpT) = Hpp. (3) 

For a homography H acting on points, its dual, now acting on planes, is H'^, 
where plane-vectors are written as columns a. An element Hpp of the Lie alge- 
bra is a tangent operator acting on points, which has a corresponding tangent 
operator — acting on plane-vectors a. 

Since the action of the projective displacement group preserves the scale 
p hidden in the projective coordinates, the orbit of vectors in are in fact 
hyperplanes of K"', characterized hy p = (a^ l)7Vf . Thanks to this, a projective 
motion Hpt(<) Af of a point has the velocity M 0), and dually, a plane Hj^(t)a 
has a velocity a ( 0 , both well-defined up-to an individual scalar p. Analogous 
to se(3), these velocities can be calculated using the projective operator Hpp 
tangent to HpT’(t) at t, or its dual. 



M(t) = HpT(t)M, M = HptM, 
a(t) = Hj^(t)a, a = — Hp^a. 



( 4 ) 

( 5 ) 



Below, the relationships between points, lines, planes and their duals are briefly 
stated. For a point or a line through two points Ai, their dual is determined by 
the null-space or kernel (ker) defined in (jOI). Geometrically, they are respectively 
3- or 2-planes aj with the point or the line being their intersection. 
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Robot Kinematics in Projective Space m- Consider an uncalibrated 
stereo rig observing a robot manipulator and capturing the end-effector’s motion 
by continuously reconstructing some marked points on it (Fig. 0. The projective 
motion Hg{q) as a function of the vector q of joint variables is a product of the 
projective motions of each of the joints. These are either projective rotations 
H_r( 0) of a revolute joint, or projective translations Ht’(t) of a prismatic joint 
(0, ini. Both are generically denoted as Hj{qj) for joint j. Mathematically, they 
are conjugate representations of the classical one-parameter Lie groups SO{2) 
and They have Lie algebras conjugate to the classical Lie algebra so(2) and 
which have respective representations as 4 x 4 matrices, H/j and Ht. 
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( 7 ) 



The similarity Hj is different for each joint and contains the joint’s position 
and orientation, as well as a part of the calibration matrix UpE- These two 
contributions are difficult to separate in general. 

Since the conjugacy preserves the underlying algebraic structure, the pro- 
jective representations can be manipulated without resolving the similarity, i.e. 
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without calibration. Therefore, respective formulae for going from “ algebra-to- 
group” 0 and from “group-to-algebra” (0 can be shown to have closed forms 
analogous to the Euclidean ones: 

Hfi(6») =I + sin6»Hfl + (l-cos6»)H^, Ht(t) = I + tHt, (8) 

(9) 

In practice, the benefits of these equations are as follows. On the one hand, 
an observed trial motion (Hj(gj) S group) of a single joint i allows the corre- 
sponding operator (H^ G algebra) to be recovered, representing projectively its 
kinematics. On the other hand, for given joint values qj (Oj or Tj), the six joint 
operators H^- constitute a projective kinematic model, and the forward kinema- 
tics Hq(< 7 ) = H6(q) can be calculated projectively as the product-of-exponentials 
dEl, where each exponential has one of the above analytic forms 



Hj(q) = exp(giHi) • • •exp(gjHj). (10) 

The robotics literature calls such a model “zero-reference” as it refers to origin 
q = 0 of joint-space. For q{t) being a joint-space motion starting at zero, the 
partial derivatives of Hi in q allow the end-effector’s velocity = H,j(t) to be 
written linearly as a sum of the joint operators (HU. Consequently, this expres- 
sion for the interaction between joint- and projective motion equally allows for 
the Jacobians 3 h of projective point- or plane velocities, M or d, to be written 
in matrix form JED- 

H, = qiHi H h qeHe, where H^- = dUg/dqjl^^^ (11) 

M= [HiMi, ..., H6Mi]^''®q, d= [-H^ai, ..., (12) 

An image-based visual servoing could so be formulated in terms of image- 
velocities s and the Jacobian Jc(m) (H3 of the perspective projection map 
s = G{m) = (^; (((^)^- In contrast, we will remain in projective three-space 

and use M instead of s and J^. 
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3 Projective Mechanisms 

In this section, we express primitive motions in terms of “virtual mechanisms”, 
constraints on the mobility of points and planes in projective three-space. Sol- 
ving these constraints for a joint-space motion and the resulting projective mo- 
tion amounts to a local “decoupling” of the general projective kinematic model 
into such projective mechanisms. Formulating the problem in the visual do- 
main allows these constraints to reflect the geometry underlying the current 
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task ('section k.lll . to express visibility conditions ('section 14.211 and feasible tra- 
jectories (section ^31, and to directly compute the joint- velocities actually dri- 
ving the visual mechanisms (section Translation along an Axis. Sup- 
pose a direction of translation given in terms of an axis through two points 
Ai , A 2 ■ Their dual is a pencil of planes spanned by any two planes aj , aj 
that intersect in the above axis ( 0 . A rigid motion for which the velocities 
of both planes vanish is a pure translation along the given axis, and this is 
the only such rigid motion. Since the projective kinematic model - here the 
plane-operators for each joint j - allows all rigid motions and respec- 

tive plane-velocities to characterize, one can look for the only joint-space mo- 
tion qt for which these plane-velocities vanish. This is formalized by requi- 
ring q = qt to be in the kernel in d, with zz arbitrary scalar. The cor- 
responding one-dimensional group of projective translations is then described 
by its operator Ht which can be obtained as a linear combination based on 
Qt- 



V qt = ker 



-H7a2, 



-Hjai' 

-H(Ta2j ’ 



Ht = QtiHi -!-••• + greHe. (14) 



Revolutions around an Axis. Given two points Ai,A2 on an axis and the 
point-operators Hj for each joint j. Among all rigid motions, here expressed as 
joint-space motions, the one qr for which the velocities of both points, i.e. the 
sums of HjAi and HjA 2 , vanish, results in a revolution H/j around the axis 
connecting the points. Thus, it can be written as the kernel in USD with free 
scalar z/. The operator corresponding to Qr generates the correspon- 

ding one-dimensional projective rotation group, the eigenvalues of which allow 
the scale of v to be normalized to radians. 



Ty Qr = ker 



Hfl = (toHi-b--- + toH6), (15) 



HiAi, . . . , HgAi, 

H1A2, . . . , HeA 2 

where v is chosen such that has eigenvalues i, —i. 



Revolution around a Point in a Plane. Another way to visually constrain 
a revolution is as follows: Suppose the action of the revolution on a given plane 
al to be a “planar” rotation, i.e. the plane turns “in-place”, and suppose ad- 
ditionally one point Ai on the axis. Among all joint-space motions, i.e. among 
all rigid motions, the one qp for which both velocities, one resulting from point- 
velocities iljAj and one resulting from the plane- velocities — Hjoi, vanish, is 
the above described “planar” revolution Hp. The axis of this revolution passes 
through Al and is perpendicular to aJ . It can be written as the kernel in (HED, 
and is recombined and normalized to the point-operator Hp of the corresponding 
one-dimensional projective rotation group: 



V qp = ker 



HiAi, 



-njai 

He^i 



Hp = (gpiHi -f • • • -I- gpeHe), (16) 



where z/ is chosen such that Hp has eigenvalues i, —i. 
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Actually, the above postulated equivalence between joint-space motions and 
rigid (projective) motions holds only in case of the robot being fully actuated. In 
case of under-actuation, the kernel becomes empty if the projective mechanism 
corresponds to the missing degree-of-freedom. In case of a singularity, the kernel 
is of a higher dimension. It comprises a fixing motion which yields a zero mo- 
vement for all points and points, and possibly the projective mechanisms itself, 
which can be detected easily. So, either there exists currently no joint-space mo- 
tion corresponding to the projective mechanism or there exists a family of such. 
This direct relationship between robot singularities and mobility, as defined in 
the visual (projective) domain, is highly useful for singularity avoidance in visual 
servoing. 




4 Trajectories 

In this section, the idea is to rewrite a given alignment task in terms of three 
primitive motions and to extend this to a reaching motion guided by trajectories 
that are visually and globally feasible. The task is partitioned in section into 
a translation of a central point, followed by a hinge-like rotation of the face 
onto the target face, and finally rotation of the markers within the face plane 
onto their target positions (Fig. [Q). Although the partitioning results in the 
primitive motions being “in-sequence”, the way they are constructed ensures 
that a subsequent motions do not disturb the results of the preceding ones. For 
instance, both rotations are about the center point, thus it remains unaffected. 
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To 



image plane 



visibility 

change 



Fig. 2. Translation to ro where visibility Fig. 3. Rotation of the face in the sense 



and the final rotation moves points only within a plane, but the face as a plane 
remains unaffected. Consequently, the three motions can be driven independently 



These “Cartesian motions”, as they are called in robotics, are a superposition 
of a straight-line trajectory of a central point with a rotation about this point. 
A feed-forward control of such trajectories, as done in section El assures global 
validity of the visual control and allows for permanent visibility of the face, as 
developed in section 63 ). 

4.1 Partitioning 

In this section, we describe the computational geometry used to translate the 
geometry underlying the task into constraints on projective mechanisms (section 
El, where a task is given by the current and the target position of the markers, 
A and A*, and the current and target position of the face plane, and aj. 
The result will be three projective mechanisms, i.e. their operators H*, H^, Hp 
and their joint-space equivalents gt, gr,Qp, corresponding to the three primitive 
motions of the task. 

The first component (translation of center) is to choose one marker 
or the marker’s midpoint as a center point Ac and to partition the task into a 
translation (Fig.^left), modulo a rotation of the face around the center. The 
respective operator H* is obtained by applying (HU to the center’s current and 
target position, Ac and Ac*. A “distance-to-target” is the amplitude r of this 
translation. It is obtained by solving m for r, which represents the intersection 
of Ac’s straight-line path with a transversal plane through Ac* (Fig.HJ left). 



changes. 



which preserves visibility. 



without the general characteristics of the trajectories fsection ri.dl) being affected. 





( 17 ) 
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Then, the translation can be “removed” from the task by applying the projective 
translation Hi(r) = I + rH( ’’backwards” onto the target primitives, which after 
that are subscripted with 

Au = Ht(-r)Aj*, at = H(“^(-r)a» (18) 

The removal in the “backwards” sense results in the residual task being expressed 
by new target primitives, and , and in the subsequent rotations being 
expressed for the current position of the primitives. This also implies them being 
expressed for the current robot configuration, which is crucial for the direct 
control in section |3 to be valid. 

The second component (hinge of two planes) is a rotation around the 
axis of two planes: the initial and the translated target face, aJ and aj . In 
this way, the rotational part of the task is split into two, a rotation Hj.(0j.) 
onto the target plane (Fig. Q center), modulo a residual rotation Hp within this 
plane (Fig. Q] right). The respective operator is obtained by applying m 
to two points on the axis. An “angle-to-target” is determined by solving (cnj 
for 9r, which represents the intersection of the new target plane aj with the 
circular path of a point A^ on the face. The resulting first-order trigonometric 
equation in 9r and with coefficients po,Ps, Pc (EUt has an analytic solution 9^ , 0+ 
which is founc0 after half-angle substitution as the arctan in the roots a, (3 of 
two quadratic equations (EJ. 

aJ (l -f sin 9riir + (1 - cos6»^)H^) Ad = 0, (19) 

Po +PsSin6*r -l-PcCOs6*r = 0, (20) 

0+ = arctan(o:(po,Ps,Pc),/3(po,Ps,Pc)), 9~ = n - 9^ (21) 

Again, the rotation is removed by applying the projective rotation Hr(— 0^) 
backwards, now to the (-primitives, resulting in new target primitives then sub- 
scripted with r- 



Air = ilr{ — 9r)-Ait, Or = H^^(— 0j.)a( (22) 

The third component (rotation within plane) is planar revolution Hp(0p) 
of the face around A^. to finally move the markers onto their target positions 
Air - The respective operator Hp is obtained by applying (I I till to and Ac. The 
angle 9p requires intersecting the circular path of a point Ac with a transversal 
plane aJ through the corresponding target point Agr (Fig.Q right), in analogy 
to (I I till . 

4.2 Visibility 

In general, naive projective coordinates are “unoriented”. So, the front- and 
back-side of the face are distinguishable, and visibility is undecidable. However, 



^ For the sake of briefty, full technical detail had to be omitted. 
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projective displacements preserve the scalar p of an orbit (section |2I), and so 
the sign of projective coordinates of both, points and planes. Therefore, it is 
decidable in this case from the sign of their product wether a point changes side 
with respect to a plane. For the face plane and the optical center O, this 
amounts to detecting changes in the face’s visibility from the sign of ■ O. 
For the above introduced one-parameter motions, such an event is precisely 
characterized by an amplitude Tq (Fig. 0, or a pair of angles 9^, 9^ (Fig. 0), as 
obtained from UEZD or JED applied to the optical center. 

In terms of trajectories and their components this means the following. If 
the translation is towards and beyond rg, a respective reorientation of the face 
is required before the feed-forward reaches tq (Fig. EJ. If the rotation is either 
leaving or entering the interval [0(j" , , the visibility changes and a respective 

translation of the face is required before the feed-forward reaches a 9q (Fig. Ej). 
The above concerns only the rotation H^., since visibility remains unaltered under 
planar rotation Hp. Additionally, such “side-of-plane” arguments are heavily 
used in the implementations of sections IQ and O in order to determine the 
“right” sense of 9r (ED- This is required to avoid the back-face being turned 
towards the camera (Fig. EJ or the face being moved backside-up onto the target 
(Fig. P. Please note also that in presence of the second camera, the above 
arguments apply independently to both of them, such that the most conservative 
thresholds Tq, 9q have to be taken. 



4.3 Generation 

Now, we formalize in (r23ll a family of Cartesian trajectories Hd(cr) allowing to 
simultaneously execute the three independent parts of the task (Fig. El- Three 
functions pticr), Mp(o") in a common abscissa cr (E3) allow to modify the 

characteristics of the trajectories and to incorporate the visibility constraints. 
In analogy to the product-of-exponentials (II 1)11 . the projective mechanisms have 
to be multiplied in reverse order (EHI) for the desired trajectories to emerge. 
Intuitively, the translation is the left-most one, since the rotations must not 
affect its direction nor the position of the center. The hinge is the second one, 
since the planar rotation must not affect the face as a plane: 

Hd{T{a),9r{<T),9p{a)) = exp(r((r)Ht) exp(6'r-(cr)Hr) exp(6»p(CT)Hp), (23) 

t(ct) = ^t(cr)r, 9r{a) = pr{cr)9r, 9p{a) = Pp{a)9p. (24) 

Here, the p, are monotonically growing functions [0,t*] — >■ [0,1] subject to vi- 
sibility constraints between pr and pt- More formally, if {9r{<j)a is visible 

then pt{(7) < To, and pt{cr) > tq otherwise. Vice versa, if (r((T))a* is visible 

then Pr{o) < 9 q, and Pr{cf) > 9 q otherwise. Note that Pp{<j) is always uncon- 
strained. Either of these cases can be used to drive a feed-forward in either r 
or 9r while constraining the other correspondingly. Additionally, the overall be- 
haviour can be modified. For instance, a linear decay of time-to-goal arises for 
p{u) = (T, whereas an exponential decay as in classical feed-back loops arises 
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for = 1 — exp(— ^). An initial very flat plateau in the respective /r allows 
trajectories like rotation- first, translation- first, planar-first to be implemented. 

Although robot guidance based on trajectory tracking and a feed-back law is 
now perfectly feasible, we will further exploit the above established direct relati- 
ons between joint-space motions, projective mechanisms, Cartesian trajectories, 
and velocities in projective-space in order to come up in the next section with a 
directly computed control. 



5 Control 

In this section, we devise a twofold scheme for a directly calculated visual servo 
control. On the one hand, there is a feed-forward steering loop which drives the 
robot along trajectories restricted by visibility and other mobility constraints. 
On the other hand, there is a feed-back servoing loop which drives a 3-dof or 
2-dof Cartesian control-error down to zero. The video feed-back from the stereo 
cameras serves as input to both loops, which actually are just two interpretations 
of one and the same calculation. As a result, the servoing no longer generates a 
linear image motion, but a “Cartesian” motion in three-space. In order to apply 
directly the results of the previous sections, the projective kinematic model has 
to be generalized to come up for varying robot configurations. 



5.1 Generalized Projective Kinematics of a Moving Robot 



The kinematic model presented so far is only valid around the zero of the robot. 
As the robot moves so do its joints, and their operators change, respectively. 
Hence, the generalized projective kinematics will consists of operators Hj|^ 
which are expressed in function of the current configuration Q of the robot, 
and it will refer to the joint-space shifted by q{t) — Q. This is well-known in 
robotics and is utilized in Cartesian velocity control rather commonly. Here, the 
formulation has to be extended to the projective model, where the arguments of 
the respective proofs are essentially based on the properties of conjugate forms. 
The equations are stated in (E3 and are intuitively explained as follows: for each 
joint j, first its own displacement, expressed by the truncated forward kinematics 
Hj = Hj(Q) must be undone, then the initial operator Hj is applied as 

beforehand, and after that the joint must return to Q. The Jacobian for the 
current position M{t) of a projective point equally uses the current operator 
values 



H 



-i\q- 



H, H, H- 






4x6 






,H6|QM(t) 



(25) 



To summarize, we have a general model for the projective kinematics and the 
Jacobian of a projective point in form of an analytical expression in Q, i.e. in 
configuration space. It is a sound linear model of the instantaneous interaction 
between joint-space and projective space. In consequence, as long as this general 
model is applied in sectionsElandE] the resulting joint-space motions q are valid 
and can be used for a direct control to be calculated. 
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5.2 Direct Control 

In visual servoing, a task is commonly represented in terms of a target image, e.g. 
by a number of image points or equivalently by their 3D-reconstruction^ 
in case of stereo. As soon as the current positions Ai{t) do overlap the target, 
the task has been achieved. Classically, the control is computed by means of 
the inverse Jacobian H2D applied to an error- vector Ai{t) — Ait, in the point- 
coordinates or in their image-coordinates Si{t) — Sit (I I dll . respectively [IJ. As 
a result of this local linear approximation, the convergence and stability highly 
depends on the conditioning of the Jacobian matrix. 

In our approach, the task is extended to guided motion towards the target. 
On the one hand, the constraints (section 14. 1 II on the motion in the current 
positions Ai{t) assure the trajectories to emerge as desired On the other 
hand, the projective kinematic model expressed for the robot’s current 

configuration allows for direct calculation of the control from the constrained 
solutions qt,Qr, %■ Above that, they give rise to a “distance-to-target” along the 
trajectory as well as a corresponding 3-dof Cartesian feed-back error {T,9r,0p). 
Therefore, the direct control can be calculated as the gain- weighted sum . 

e = (r, 6»r,6»p)^, - e = (At, Ar, Ap) e, q = [qt Qr Qp] e, (26) 

He(e) Ri exp(AtrHt -I- XrOr^lr + XpOpVLp), for 0^, 0p small. (27) 

However, this version “directTHREE” of the direct control is valid only for the 
gains being small, or for the control being recalculated at high frequencies. There 
is a systematic “integration-error” between the trajectories He (1271 as they are 
controlled and (12,41 as they are desired. However, the experiments show that 
already directTHREE allows for directly servoing a complicated reaching task 
without the deviations becoming too strong. 

By construction, the feed-forward is such that the center is undergoing a pure 
translation, and that the face is undergoing a pure rotation Hg = exp(0gHg) 
around the center (section 14. ill . This part of the direct control is valid, since 
the summed operators can be shown to integrate as desired ll2?sll . However, the 
operators of the two rotations H^, Up integrate differently than their sum does 
(EHJ- Therefore, a sound formulation “directTWO” of the direct control will 
be derived that consists of a single effective rotation and that controls only a 
2-dof feed-back error (t, 6*g)^ 

exp(^‘”H‘) exp(^“«^H») = exp(^‘”H‘+^“«^H=),but 

exp*-^’'®'’^'"^ exp*'^*’®’’^*’^ ^ gjj-p(->*rerH,.-i-ApepHp) 

e=(r,6ig)^, -e=(At,Ag)e, q = [qt qs] e, Hg 

In order to calculate Hg and Qg, e.g. for constant we make use of the fact 
that both Hp and Hp are rotations about the center, i.e. they are elements of 
Again, we allow for the general case of a projective reconstrnction. 



(28) 

(29) 

= (30) 



2 
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the respective matrix representation of so(3). The solution is provided by the 
Campbell-Baker-Hausdorff formula known in Lie-group theory m It has in 
case of so(3) an algebraic solution in closed form. For two given operators, it 
relates their product-of-exponentials to the exponential of an (infinite) sum of 
higher order Lie brackets [Hr,HpJ3 of the operators. In contrast to where a 
truncated approximation of such a sum is used, the closed form solution m, 
can be found in our case. Thanks to the operators being just conjugate forms 
of so(3), this solution can be calculated directly from the projective operators, 
as sketched below: 



o = sm ^ cos , 0 = cos 



Hg = (sin ^ ^aHj. -|- UHp + > 



H„Hp 



^sin 



&p 
2 ’ 



C = 



sin 




6r 

T* 



(31) 

(32) 



Note additionally, that only a and b in have a cosine term, so the first-order 
approximation in m is valid. 



6 Experiments 

In this section, we validate and evaluate the above theoretical results on a classi- 
cal benchmark test: a rotation of 180° around the optical axis or the stereo rig’s 
roll axis in our case (Fig. 0)). This configuration is known to be a degenerate 
one in the monocular case 0. Additionally, a potential self-occlusion is enfor- 
ced by the face being oriented transversally with respect to the image planes. 
Besides that, the dimensions correspond to those of our experimental system 
and the projective kinematic data has been taken from a recent self-calibration 
experiment El- 

First, three classical stereo servoing laws are tried (Fig. 0: pseudo-inverse 
of the stacked Jacobians |7|, their block- wise pseudo-inverse 0, and a straight- 
forward servoing for plain 3D points in Euclidean three-space (like (UTil) but 
in space). The second one, which basically sums two independent monocular 
controls, diverges while moving towards infinity. The other two laws run into the 
self-occlusion while more or less translating towards the target, but get draped 
in the local minimum. Both manage to escape slowly due to some accidental 
perturbation, but this is unpredictable. Then, they turn the face almost in-place, 
again through an occlusion, before finally converging. 

Second, trajectory generation from section 0 is tested. Figure |B| shows the 
solutions found using 1231) . where all the /i(cr) are chosen linear. In Fig. |7| a 
rather steep pt is chosen to favor the translation first. The self-occlusion has 
been avoided successfully, as evident in the figures which are rendered from a 
central view-point close to the stereo one. Besides this illustrative example, the 
control experiments establish a thorough validation of reliability as well as an 
extensive evaluation of the precision of the trajectories (Figs. (11 II) . i|1 ()I) L since 

® [Hr, Hp] = HrHp — HpHr = H^TrTp — TpTrHpB = Tp]HpB, with Tr,p 

having the classical anti-symmetric form of so(3) as upper 3x3 block. 
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each iteration of the direct control can be interpreted as a newly generated feed- 
forward trajectory. 



(T+R+P) 



CT+R+P) 





Fig. 6. Trajectories: visibility is preserved. Fig. 7. Trajectories: early translation. 



Third, the two control laws are compared with respect to the bias intrinsic 
to directTHREE ( F7li . Figure El shows this as a very small deviation of their 
image trajectories. More remarkable is the center point’s deviation from the 
desired straight-line trajectory. This deviation is also found in Figure El but it 
is vanishing with the gain decreasing. A first conjecture is that this deviation is 
due to linearization error arising when integration of a Cartesian velocities H 
is desired 112,311 but a joint- velocities q is actually driven, which are only very 
locally in exact correspondence. This is confirmed by the innermost trajectory 
for which the joint- velocities were limited to 5° in order to limit this cause of 
deviations. 

Fourth, both control-errors (12611 . (13011 are confirmed to have exponential con- 
vergence rate ('Figs. rmil 111 . In the case of directTHREE dm, we compare the 
Cartesian-error (T,0r,0p) as calculated in our projective control scheme with 
Euclidean ground-truth. The angular errors do strictly overlap, whereas the de- 
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Fig. 8. directTHREE versus directTWO. Fig. 9. Various gains + joint-speed limit. 

cay of the projective translation error r seems much steeper. However, this dif- 
ference is only an apparent one caused by the unknown scale p. In fact, it is 
absorbed by a reciprocal scaling of qt, such that the performance and behavior 
of the control remains unaffected by this ambiguity. In figure El the results 
of the directTWO law are compared, once with and once without the joint- 
velocity limit. The curve of 9s clearly reflects the task’s overall rotation of 180°, 
which beforehand was spread among the two rotational motions in 9r,9p. 



|deg] 




Fig. 10. directTWO: control-error. 



Fig. 11. directTHREE: control-error. 




Fifth, figure [O shows the error in the markers’ image coordinates. It clearly 
has no longer an exponential decay, not even a monotonic one. The zero-line 
actually reflects the center’s straight horizontal trajectory. Finally, the corre- 
sponding trajectories in joint-space are given in figures ^3, once without and 
once with the 5° limit. Apparently it is the initially high velocities of q\ and 
which are the cause for the above mentioned drift away from the straight line. 

7 Discussion 

In this paper we described a new method for robot visual guidance based on non- 
metric representations of both the stereo system and the robot’s kinematics. The 
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Fig. 12. Image errors. 



Fig. 13. Joint-space motion. 



paper is build on top of two main bodies of work: (i) the well-known framework 
for representing 3-D visual information in projective space and (ii) the motion 
representation of rigid bodies and articulated mechanism using 3-D projective 
transformations. The latter was recently introduced by the authors. 

Traditionally, visual robot control used Euclidean pose to estimate an image- 
to-robot Jacobian together with a Euclidean kinematic model to transform desi- 
red Cartesian velocities into robot joint-velocities. These are derived from image 
data and the inverse of the image-Jacobian. Here, we went directly ’’from the 
image to the joints” using a sound projective model for robot motions as seen 
by an uncalibrated stereo rig. The advantage over the Euclidean approach is 
that exact knowledge of the robot’s mechanics is not required. Moreover, the 
projective models can be estimated quite precisely on-line and on-site simply by 
observing the elementary joint motions with a stereo rig. 

Above that, we studied in detail the general task of reaching B starting from 
A, where locations A and B are described by their images. We formulated the 
decompositon of such a task into three elementary motions which satisfy several 
constraints: the features must be visible in the images all along the trajectory 
and the motion must be feasible by the manipulator. We showed how to design 
such trajectories and how to drive them efficiently in practice. 

The method was validated and evaluated on a classical benchmark test for 
visual servoing, namely a 180° turn of the end-effector, which most existing 
techniques based only on image-error measurements fail to succeed. 

Our work extends the state-of-the-art in visual servoing from calibrated or 
poorly calibrated cameras to uncalibrated stereo rigs, where the robot motion 
and kinematics as well as the reaching trajectory are represented by projective 
transformations. We believe that the latter is a promising framework for de- 
scribing articulated mechanisms and their associated constrained motions from 
image observations alone and without any prior knowledge about the geometric 
configuration at hand. 
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Abstract. We propose a general unsupervised multiscale approach to- 
wards image segmentation. The novelty of our method is based on the 
following points: firstly, it is general in the sense of being independent 
of the feature extraction process; secondly, it is unsupervised in that 
the number of classes is not assumed to be known a priori; thirdly, it 
is flexible as the decomposition sensitivity can be robustly adjusted to 
produce segmentations into varying number of classes and fourthly, it is 
robust through the use of the mean shift clustering and Bayesian multis- 
cale processing. Clusters in the joint spatio-feature domain are assumed 
to be properties of underlying classes, the recovery of which is achie- 
ved by the use of the mean shift procedure, a robust non-parametric 
decomposition method. The subsequent classification procedure consists 
of Bayesian multiscale processing which models the inherent uncertainty 
in the joint specification of class and position via a Multiscale Random 
Field model which forms a Markov Chain in scale. At every scale, the 
segmentation map and model parameters are determined by sampling 
from their conditional posterior distributions using Markov Chain Monte 
Carlo simulations with stochastic relaxation. The method is then applied 
to perform both colour and texture segmentation. Experimental results 
show the proposed method performs well even for complicated images. 



1 Introduction 

The segmentation of an image into an unknown number of distinct and in some 
way homogeneous regions is a difficult problem and remains a fundamental issue 
in low-level image analysis. Many different methodologies has been proposed but 
a process that is highly unsupervised, flexible and robust has yet to be realised. 

In this paper, we propose a general unsupervised multiscale approach towards 
image segmentation. The strength of our method is based on the following points: 
(i) it is general in the sense of being independent of the feature extraction pro- 
cess; consequently, the algorithm can be applied to perform different types of 
segmentation without modification, be it grey-scale, texture, colour based etc. 
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(ii) it is unsupervised in that the number of classes is not assumed to be known 
a priori (iii) it is flexible as the decomposition sensitivity can be robustly adju- 
sted to produce segmentations into varying number of classes (iv) it is robust 
through the use of the mean shift clustering and Bayesian multiscale proces- 
sing (v) dramatic speed-ups of computation can be achieved using appropriate 
processor architecture as most parts of the algorithm are highly parallellised. 

The complete algorithm consists of a two-step strategy. Firstly, salient fea- 
tures which correspond to clusters in the feature domain, are regarded as mani- 
festations of classes, the recovery of which is to be achieved using the mean shift 
procedure jS|, a kernel-based decomposition method, which can be shown to be 
the generalised version of the fc-means clustering algorithm |3|. 

Secondly, upon determining the number of classes and the properties of each 
class, we proceed towards the problem of classification. Unfortunately, classifica- 
tion in the image segmentation context is afflicted by uncertainties which render 
most simple techniques ineffective. To be more certain of the class of a pixel 
requires averaging over a larger area, which unfortunately makes the location 
of the boundary less certain. In other words, localisation in class space conflicts 
directly with the simulteneous localisation in position space. This has been rigo- 
rously shown by Wilson and Spann PSl to be a consequence of the relationship 
between the signals of which images are composed and the symbolic descripti- 
ons, in terms of classes and properties, which are the output of the segmentation 
process. These effects of uncertainties can however be minimised by the use of 
representations employing multiple scales. 

Motivated by this rationale, we adopted a Bayesian multiscale classification 
paradigm by modelling the inherent uncertainty in the joint specification of class 
and position via the Multiscale Random Field model Q. This approach provides 
context for the classification at coarser scales before achieving accurate boundary 
tracking at finer resolutions. 



2 The Mean Shift Procedure 

The mapping of real images to feature spaces often produces a very complex 
structure. Salient features whose recovery is necessary for the solution of the seg- 
mentation task, correspond to clusters in this space. As no a priori information is 
typically available, the number of clusters/classes and their shapes/distributions 
have to be discerned from the given image data. 

The uniqueness of image analysis in this clustering context lies in the fact 
that features of neighbouring data points in the spatial domain are strongly 
correlated. This is due to the fact that typical images do not consist of ran- 
dom points but are manisfestations of entities which form contiguous regions 
in space. Following this rationale, we represent the image to be segmented in a 
n-dimensional feature space. Position and feature vectors are then concatenated 
to obtain a joint spatio-feature domain of dimension d = n + 2. Our approach 
thus includes the crucial spatial locality information typically missing from most 
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clustering approaches to image segmentation. All features are then normalised 
by dividing with its standard deviation to eliminate bias due to scaling. 

This joint spatio-feature domain can be regarded as samples drawn from an 
unknown probability distribution function. If the distribution is represented with 
a parametric model (e.g. Gaussian mixture), severe artifacts may be introduced 
as the shape of delineated clusters is constrained. Non-parametric cluster analysis 
however, uses the modes of the underlying probability density to define cluster 
centres and the valleys in the density to define boundaries separating the clusters. 

Kernel estimation is a good practical choice for non-parametric clustering 
techniques as it is simple and for kernels obeying mild conditions, the estima- 
tion is asymptotically unbiased, consistent in a mean-square sense and uniformly 
consistent in probability jSj. Furthermore, for unsupervised segmentation, where 
flexibility and interpretation are of utmost importance, any rigid inference of 
‘optimal’ number of clusters may not be productive. By using a kernel-based 
density estimation approach and controlling the kernel size, a method is develo- 
ped which is capable of decomposing an image into the number of classes which 
corresponds well to a useful partitioning for the application at hand. Alterna- 
tively, we can produce a set of segmentations for the image (corresponding to 
different number of classes) with each one reflecting the decomposition of the 
image under different feature resolution. 



2.1 Density Gradient and the Mean Shift Vector 

Let jv be the set of N image vectors in the d-dimensional Euclidean 

space R‘^. The multivariate kernel density estimate obtained with kernel K(x) 
and window radius h, computed at point x is defined as: 

The use of a differential kernel allows us to define the estimate of the density 
gradient estimate as the gradient of the kernel density estimate m- 

U(x)^V/(x) = ^fvif(i^) (2) 

i—1 ^ 

The Epanechnikov kernel ini, given by: 

f + 2)(1 — x^x) if x^x < 1 

Ke(x) = ^ (3) 

( 0 otherwise 

has been shown to be the simplest kernel to possess properties of asympto- 
tic unbiasedness, mean-square and uniform consistency for the density gradient 
estimate [^. In this case, the density gradient estimate becomes: 



Vx d+2 
N{h‘^Cd) 



E 

XiGSh(x 



V/e(x) 



(4) 
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where the region S'h(x) is a hypersphere (uniform kernel) of radius h centred on 
X, having the volume and containing data points. The last term in Q: 

^h(x) = ^ ^ (X*-x) (5) 

XiGSh(x) 



is called the sample mean shift. The quantity is the kernel density esti- 

mate computed with the uniform kernel S'h(x), /u(x) and thus we can write (0 
as: 



which yields: 



V/e(x) = /u(x)^^^^Mh(x) 



Mh(x) 



V/e(x) 

d+2 /u(x) 



( 6 ) 

( 7 ) 



Equation o depicts the mean shift vector as a normalised density gradient 
estimate. This implies that the vector always points towards the direction of the 
maximum increase in density and hence it can define a path leading to a local 
density maximum. The normalised gradient in m also brings about a desirable 
adaptive behaviour, with the mean shift step being large for low density regions 
and decreases as x approaches a mode. 





Fig. 1. On the left: Consider the density estimation plot (in blue) of a hypothetical 
1-D feature. The gradient or derivative of the density plot is shown in red. It is obvious 
that the density gradient always points in the direction of maximum increase in density 
(bear in mind that left-to-right along the 1-D axis constitutes positive movement). On 
the right: As the mean shift vector is proportional to the density gradient estimate, 
successive computations of the mean shift define a path leading to a local density 
maximum (shown here for a 2-D feature) 



While it is true that the mean shift vector Mh(x) has the direction of the 
gradient estimate at x, it is not apparent that the density estimate at the sue- 
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cessive locations of the mean shift procedure is a monotonic increasing sequence. 
The following theorem, however, assures the convergence: 

Theorem. Let /e = { fk C^k,K^) f be the sequence of density 

L J fc=l,2,... 

estimates obtained using the Epanechnikov kernel and computed at the 
points 2 defined by the sucessive locations of the mean shift 

procedure with a uniform kernel. The sequence is convergent. 

Proof of this theorem can be found in g]. 



2.2 Mean Shift Clustering Algorithm 

The mean shift clustering algorithm consists of successive computation of the 
mean shift vector, Mh(x) and translation of the window S'h(x) by Mh(x). Each 
data point thus becomes associated with a point of convergence which represents 
a local mode of the density in the d-dimensional space. Iterations of the procedure 
thus gives rise to a ‘natural’ clustering of the image data, based solely on their 
mean shift trajectories. 

The procedure in its original form, is meant to be applied to each point in 
the data set. This approach is not desirable for practical applications especially 
when the data set is large as is typical for images. The conventional mean shift 
procedure has a complexity of 0{N^) for a set of N data points. A more realistic 
approach consist of a probabilistic mean shift algorithm as proposed in ^ whose 
complexity is of 0{mN), with to <C A, as outlined below: 

1. Define a random tessellation of the space with m N hyperspheres S'h(x). 
To reduce computational load, a set of to points called the sample set, is 
randomly selected from the data. It is proposed that two simple constraints 
are imposed on the sample set: firstly, the distance between any two points in 
the sample set should not be smaller than h, the radius of the hypersphere, 
5'h(x). Secondly, sample points should not lie in sparsely populated regions. 
A region is defined as sparsely populated whenever the number of points 
inside the hypersphere is below a certain threshold Ti. The distance and 
density constraints automatically determine the size to of the sample set. 
Hyperspheres centred on the sample set cover most of the data points. These 
constraints can of course be relaxed if processing time is not a critical issue. 

2. The mean shift procedure is applied to the sample set. A set containing to 
cluster centre candidates is defined by the points of convergence of the to 
mean shift procedures. As the computation of the mean shift vectors is based 
on almost the entire data set, the quality of the gradient estimate is not 
diminished by the use of sampling. 

3. Perturb the cluster candidates and reapply the mean shift procedure. Since a 
local plateau can prematurely stop the iterations, each cluster centre can- 
didate is perturbed by a random vector of small norm and the mean shift 
procedure is left to converge again. 
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4. 



5. 



Derive the cluster centres Yi,Y 2 ,...,Yp from the cluster centre candidates. 
Any subset of cluster centre candidates which are less than distance h from 
each other defines a cluster centre. The cluster centre is the mean of the 
cluster centre candidates in the subset. 

Validate the cluster centres. Between any two cluster centres Y^ and Y^, a 
significant valley should occur in the underlying density. The existence of the 
valley is tested for each pair (Y^, Y^). The hypersphere S'h(x) is moved with 
step h along the line defined by (Y^, Yj) and the density is estimated using 
the Epanechnikov kernel, along the line. Whenever the ratio between 



min 



/(Y,),/(Y,) 



and the minimum density along the line is larger than a 



certain threshold, T 2 , a valley is assumed between Y^ and Y^-. If no valleys 
are found, the cluster centre of lower density, (Y^ or Y^) is removed from 
the set of cluster centres. 



The clustering algorithm makes use of three parameters: the kernel radius, 
h, which controls the sensitivity of the decomposition, the threshold Ti, which 
imposes the density constraint on the sample set and T 2 , corresponding to the 
minimum acceptable peak-valley ratio. The parameters Ti and T 2 generally have 
a weak influence on the final results. In fact, all our experimental results as per- 
formed on 256x256 resolution images were obtained by fixing T\ = 50 and 
T 2 = 1.2. As the final objective of a segmentation is often application specific, 
top-down a priori information controls the kernel radius h, resulting in data 
points having trajectories that merge into appropriate number of classes. Al- 
ternatively, the ‘optimal’ radius can be obtained as the centre of the largest 
operating range which yields the same number of classes. Finally, cluster centres 
which are sufficient close (distance being less then h apart) in the n-dimensional 
‘feature-only’ space (remember, n = d — 2) are merged in order to group similar 
features which are spatially distributed. 




Fig. 2. Flexibility of mean shift clustering in determining the number of classes. From 
left: Image of ‘house’ and its corresponding segmentations using h = 0.2 (47 classes), 
h — 0.4 (15 classes) and h = 0.8 (8 classes) in the 5-dimensional normalised Euclidean 
space. The classihcation strategy is implemented using techniques detailed in Sect. 3 
and 4 



We shall assume these validated cluster centres to be manisfestations of un- 
derlying class properties for our image segmentation task, with each class thus 
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represented by an n-dimensional feature vector. We then proceed with a mul- 
tiscale Bayesian classification algorithm outlined below. A Bayesian approach 
is used because the notion of likelihood can be determined naturally from the 
computation of dissimilarity measures between feature vectors. Moreover, priors 
can be effectively used to represent information regarding segmentation results 
of coarser scales when segmentation is being performed for finer resolutions. 

3 The Multiscale Random Field Model 

A multiscale Bayesian classification approach is implemented using the Multis- 
cale Random Field (MSRF) model p. In this model, let the random field V 
be the image that must be segmented into regions of distinct statistical beha- 
viour. The behaviour of each observed pixel is dependent on a corresponding 
unobserved class in X. The dependence of observed pixels on their class is spe- 
cified through the probability p{Y = y\X = x), or the likelihood function. Prior 
knowledge about the size and shapes of regions will be modelled by the prior 
distribution p(X). 

X is modelled by a pyramid structure multiscale random field. is assu- 
med to be the finest scale random field with each site corresponding to a single 
image pixel. Each site at the next coarser scale, corresponds to a group of 

four sites in X'-’^K And the same goes for coarser scales upwards. Thus, the mul- 
tiscale classification is denoted by the set of random fields, X^^\n = 0, 1, 2, .... 

The main assumption made is that the random fields form a Markov Chain 
from coarse to fine scale, that is: 

p ,l>n^ =p (8) 

In other words, it is assumed that for contain all relevant in- 

formation from previous coarser scales. We shall further assume that the classi- 
fication of sites at a particular scale is dependent only on the classfication of a 
local neighbourhood at the next coarser scale. This relationship and the chosen 
neighbourhood structure are depicted in Fig. 0 

3.1 Sequential Maximum a Posteriori (SMAP) Estimation 

In order to segment the image Y, one must accurately estimate the site classes 
in X. Generally, Bayesian estimators attempt to minimise the average cost of an 
errorneous segmentation. This is done by solving the optimisation problem: 

X = arg min E (C(A, x)\Y = y) (9) 

X 

where C{X, x) is the cost of estimating the ‘true’ segmentation X by the appro- 
ximate segmentation x. The choice of functional C is of crucial importance as 
it determines the relative importance of errors. Ideally, a desirable cost function 
should assign progressively greater cost to segmentations with larger regions of 
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Fig. 3. Blocks 1,2,3 and 4 are of scale n and have a common parent at scale n + 1, 
i.e. 5o, which they are dependent on. The arrows show additional dependence on their 
parent’s neighbours: 5i, ^2 and S 3 



misclassified pixels. To achieve this goal, the following cost function has been 
proposed Q: 



CsMAP = J + 



n=0 



where: 



Lj 

C„{X,x) = l-l[6(^X^ 






( 10 ) 



(11) 



The behaviour of Csmap is solely a function of the coarsest scale that contains 
a misclassified site. The solution is given by: 

=argm^ax{p(x(") = = y) (12) 

where e is a second order term which may be bounded by: 

0<e(x("))< max^p(x(”-i) = = y) «1 (13) 

Using Bayes rule and ignoring the contribution of e, one obtains the following 
equation: 



n(") 



argmax{p(Y* = y\X^^'> = x^^'>)p{X^^'> =a;l^l)} 

j.{L) 



for n — L 



argmax{p(Y' = yjX^”! = for n < L 

(14) 



where L is the coarsest scale of the multiscale pyramid. The solution is initialised 
by determining the maximum a posteriori (MAP) estimate of the coarsest scale 
field given the image Y . The MAP segmentation at the next finer scale, is 
then found by computing the MAP estimate of X^^^\ given and the image 

Y, hence the name sequential MAP (SMAP) estimator. For our experiments, we 
assumed a uniform prior for X^^'^ but in general, any suitable priors may be 
used. 
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3.2 Likelihood and Prior Probability Functions 

We will assume that at a particular scale, the observed sites are conditionally 
independent given their classes: 

p(y = y|X(")=x(")) = p(y, = y,|Xi")=4")) (15) 

sGS(") 

where the index s denotes individual sites at scale n, y^ represents the ‘averaged’ 

fn) 

feature vector of observed site Yg and Xs correspond to segmentation classes 
which have values taken from A = {1,2, where c is the total number of 

classes. 

The multiscale averaging to generate Ys at each scale is achieved using the 
lowpass subimages of Kingsbury’s complex wavelet decomposition (KCWD) pij 
of each feature component of y. The advantage of KCWD over the more conven- 
tional discrete wavelet transform for multiscale representation of features lies in 
the remarkable shift invariance property of the former approach. To illustrate, 
the figure below shows grey-level feature averaging of ‘lenna’ using the lowpass 
subimages of KCWD: 




Fig. 4. Grey-level feature averaging of ‘lenna’ using the lowpass subimages of Kings- 
bury’s complex wavelet decompostion at scales (from left) n = 0, 1, 2 and 3 respectively, 
with excellent shift invariance 



We choose to model p(Vg = yslxi”^ = xi"^) as a Gaussian distribution: 




where H-H denotes Euclidean distance. The variance parameter (T„ typically in- 
creases with segmentation resolution, which agrees with the increased class un- 
certainties at finer scales. 

From our assumptions on the label field X, we have: 

p = n P 

sesM 



( 17 ) 
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where Ss denotes the neighbourhood structure shown in Fig. |3 We choose the 
following form for the right-hand-side term above: 




where Sm,n represents the unit delta function. The scale dependent parameter 
a„ G [0, 1], determines the probability that the class of the fine scale site remains 
the same as that of one of the coarser scale local neighbourhood. Conversely, 
1 — is the probability that a new class will be randomly chosen from the 
remaining classes. 

3.3 Parameter Estimation 

In order for the method to be adaptive to the segmentation at hand, the MSRF 
model parameters has to be estimated at each scale. A Markov Chain Monte 
Carlo (MCMC) sampling approach is used in a predetermined sequential scan to 
sample the model parameters and the segmentation map from their conditional 
distributions in a specific order. The conditional distributions of the segmenta- 
tion map and the model parameters are difficult functions to maximise because 
they are multimodal and the vast combined parameter spaces are composed of 
both continuous and discrete subspaces. The Metropolis-Hastings algorithm [Z|, 
m is a robust MCMC optimisation algorithm which is ideally suited to be 
applied to these types of problem. 

The stochastic relaxation process of simulated annealing j0| is used. At initial 
high temperatures, the probability of acceptance is very high but it reduces with 
the gradual cooling of the annealing temperature to reach the global maximum 
at very low temperatures. The first step consist of sampling the class field. The 
conditional distribution, from equations (USD and m, is given by: 



where Tt is the annealing temperature at iteration t of the algorithm and the 
distributions for the likelihood and prior terms are given by (I I till and ta res- 
pectively. 

For the sampling of cr„ and a„, the respective conditional distributions are: 



p (a(") = = y,a„,a„) cx 



I 



sGS(") 




(19) 



cr„ : p(cr„|A(”) = = y) oc 
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with the likelihood terms given by equations (unj and (CHI) for (T„ and a„ res- 
pectively. Non-informative or reference priors |H1 were used for all experiments. 

Our choice of the likelihood and prior probability distributions also makes it 
possible for the dissimilarity term of (II til and the delta function terms of (I I !SII to 
be calculated for each segmentation class prior to the MCMC sampling proce- 
dure. Therefore, these terms need to be computed only once and not repeatedly 
for each iteration of the Metropolis-Hastings algorithm. This greatly decreases 
the overall computation time. More importantly, as the computation of con- 
ditional distributions at each site is independent of each other at a particular 
iteration, dramatic speed-ups of calculations can be achieved using systems with 
highly parallel architecture. 

There has been much debate of how convergence might relate to the annealing 
schedule used. Theoretically, the logarithmic schedule of [0| is guaranteed to 
converge in infinite time. In practice, this is not implementable. We have adopted 
a linear schedule which produces robust convergence in a relatively short time. 

We now apply the complete algorithm to perform the challenging tasks of 
colour and texture segmentation. 

4 Colour Segmentation 

Colour correlates with the class identity of an object because pigments form 
part of the appearance of an object and thus provide vital cues for segmentation 
purposes. In our paper, the perceptually uniform CIE L*a*b* space is used to 
represent colour features. It is generated by linearly transforming the RGB colour 
space to the XYZ colour space followed by a non-linear transformation. The 
non-linear transformation is determined by relation to a nominally white object- 
colour stimulus which gives the tristimulus values (Xn,Yn, Z^). The lightness L* 
is given by: 

f 116(Y/Y„)5 - 16 for (Y/Y„) > 0.008856 
L*=<^ (22) 

[ 903.3(Y/Yn) for (Y/Y„) < 0.008856 

The values a*,b* are given as follows: 

a* = 500 {/(X/X„) - f{Y/Yn)} (23) 
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( 21 ) 



b* = 200 {/(Y/Y„) - /(Z/Z„)} 



(24) 
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where: 



m 



ti for t > 0.008856 

7.787f + for f < 0.008856 



(25) 



The distance between two colours as evaluated in L*a*b* space is simply the 
Euclidean distance between them: 



Z\EL.a*b* = ^{A-L*Y+{Aa*Y+{Ah*Y ( 26 ) 

The CIE L*a*b* is as close to be perceptually linear as any colour space is 
expected to get. Thus the distance measure in (12011 effectively quantifies the 
perceived difference between colours. 

Figures El and El show some typical colour segmentation results using our al- 
gorithm. To determine the number of classes, mean shift clustering using h = 0.7 
(in the normalised 5-dimensional Euclidean space) were used for all experiments 
to demonstrate that the kernel radius /i is a robust parameter that does not re- 
quire tedious ‘trial-and-error’ tinkering to achieve desired results for each image. 




Fig. 5. First row: The ‘hand’ image and the three classes segmented by the algorithm. 
Second row: Segmentation results shown at every intermediate scale corresponding to 
(from left) n = 4, 3, 2, 1 and 0 respectively 



Segmentation of the ‘hand’ image shown in figure El shows the algorithm 
being able to easily distinguish the human hand and the blue doughnut-like 
object from the textured background. As shown by the segmentation result at 
each scale, processing at coarse scales gives context to the segmentation based 
on which processing at finer resolution achieves boundary refinement accuracy. 

Figure Elshows more colour segmentation results. For the ‘jet’ image, the toy 
jet-plane, its shadow and the background are picked out by the algorithm despite 
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Fig. 6. First row: The ‘jet’ and ‘parrot’ image and their corresponding segmentations. 
Second row: The ‘house’ and ‘fox’ image and their corresponding segmentations 



the considerable colour variability of each object. Segmentation of the ‘parrot’ 
image reveals a fairly smooth partitioning with all major colours bounded by 
reasonably accurate boundaries. The algorithm also produces a meaningful seg- 
mentation of the ‘house’ image with the sky, walls, window frames, lawn and 
trees/hedges isolated as separate entities. The ‘fox’ image poses a tricky problem 
with its shadows and highlights but the algorithm still performs reasonably well 
in isolating the fox from the background although there is inevitable misclasifi- 
cation at the extreme light and dark regions of the fox due the L*a*b* features 
used. Generally, for all the images, the well-defined region contours reflect the 
excellent boundary tracking ablility of the algorithm while smooth regions of 
homogeneous behaviour are the result of the multiscale processing. 



5 Texture Segmentation 

The figure below m illustrates a texture feature extraction model. Basically 
x(m,n) is the input texture image which is Altered by h(k,l), a frequency and 
orientation selective Alter, the output of which passes a local energy function 
(consisting of a non-linear operator, /(.) and a smoothing operator, w{k,l)) to 
produce the Anal feature image, v(m,n). Basically, the purpose of the Alter, 
/i(fc, 1), is extraction of spatial frequencies (of a particular scale and orientation) 
where one or more textures have high signal energy and the others have low 
energy. A quadrature mirror wavelet Alter bank, used in an undecimated version 
of an adaptive tree-structured decomposition scheme |2|, perform this task for 
our experiments on textures. 

Numerous non-linearity operators, /(.), have been applied in the literature, 
the most popular being the magnitude, |a:|, the squaring, (a;)^ and the rectifled 




82 



A.H.Kam and W.J. Fitzgerald 




Image ggale and Non-Linear Smoothing 



Orientation Operator Operator 

selective filter I 

Local Energy Function 



Fig. 7. Block diagram of the texture feature extraction model 



sigmoid, |tanh(aa:)| . It has been found that squaring in conjuction with the 
logarithm after the smoothing to be the best operator pair for unsupervised seg- 
mentation from a set of tested operator pairs HH. For this reason, this operator 
pair is used for our experiments. 

Several smoothing filters are possible for w{k, 1) and the Gaussian lowpass 
filter is one candidate. The Gaussian lowpass filter has joint optimum resolution 
in the spatial and spatial frequency domains, with its impulse response given by: 

.»G(t,0 = 2^exp{-S^} (27) 



If we want to estimate the local energy of a signal with low spatial frequency, the 
smoothing filter must have a larger region-of-support and vice versa. Hence, the 
smoothing filter size may be set to be a function of the band centre frequency, 
fo- With /o normalised (— | < /o < |), it has been suggested |B| that: 



1 

2V2\fo\ 



(28) 



This smoothing filter is also scaled so as to produce unity gain in order for the 
mean of the filter’s output to be identical to that of its input. 

For dimension reduction and extraction of saliency, principal component ana- 
lysis is performed on the raw wavelet features, v(m,n). The final feature space 
for the texture segmentation task consists of two dimensions of textural features 
(the top two principal components, which typically contribute more than 85% 
of the total variances of the wavelet features) and one dimension of luminance. 

Figure 0 shows some texture segmentation results. Again, as in colour seg- 
mentation, mean shift clustering with kernel radius h = 0.7 is used to determine 
the number of classes. For the ‘brodatz’ image, the algorithm is able to distin- 
guish all 5 textures of the Brodatz texture mosaic and produced a highly accurate 
segmentation map. The segmentation of the SAR image, ‘sar’ depicts remarka- 
ble preservation of details as well as accurate boundary detection. The image 
‘manassas’, an aerial view of the city of Manassas, Virginia provides an inte- 
resting challenge to the algorithm, which as shown, is able to successfully isolate 
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densely populated areas from roads and flat plains. The leopard of the image 
‘leo’ is also successfully segmented from background grass and scrubs; ‘misclas- 
sifled’ regions constitute shadows and relatively large homogeneous regions of 
black spots on the legs. 




Fig. 8. First row: The ‘brodatz’ and ‘sar’ image and their corresponding segmentations. 
Second row: The ‘manassas’ and ‘leo’ image and their corresponding segmentations 



6 Summary and Discussion 

In this paper, we have proposed a general multiscale approach for unsupervised 
image segmentation. The method is general due to its independence of the fea- 
ture extraction process and unsupervised in that the number of classes is not 
known a priori. The algorithm is also highly flexible due to its ability to control 
segmentation sensitivity and robust through the use of the mean shift procedure 
and multiscale processing. 

The mean shift procedure has been proven to perform well in detecting clu- 
sters of complicated feature spaces of many real images. By controlling the kernel 
size, the procedure is capable of producing classes whose associative properties 
correspond well to a meaningful partitioning of an image. The Multiscale Ran- 
dom Field model makes effective use of the inherent trade-off between class and 
position uncertainty which is evident through the excellent boundary tracking 
performance. This multiscale processing reduces computational costs by keeping 
computations local and yet produces results that reflect the global properties of 
the image. 

The proposed method has been shown to perform well for colour and texture 
segmentation of various images. It produces desirable segmentations with smooth 
regions of homogeneous behaviour and accurate boundaries. We believe these 
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segmentations possess a high degree of utility especially as precursors to higher 

level tasks of scene analysis or object recognition. 
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Abstract. In this paper we introduce a non-parametric clustering algo- 
rithm for 1-dimensional data. The procedure looks for the simplest (i.e. 
smoothest) density that is still compatible with the data. Compatibility 
is given a precise meaning in terms of the Kolmogorov-Smirnov statistic. 
After discussing experimental results for colour segmentation, we outline 
how this proposed algorithm can be extended to higher dimensions. 



1 Motivation and Overview 



The quest for robust and autonomous image segmentation has rekindled the inte- 
rest of the computer vision community in the generic problem of data clustering 
(see e.g. |3lbll5lll2llb| ). The underlying rationale is rather straightforward: As 
segmentation algorithms try to divide the image into regions that are fairly ho- 
mogeneous, it stands to reason to map the pixels into various feature-spaces (such 
as colour- or texture-spaces) and look for clusters. Indeed, if in some feature- 
space pixels are lumped together, this obviously means that, with respect to 
these features, the pixels are similar. By the same token, image regions that are 
perceptually salient will map to clusters that (in at least some feature-spaces) 
are clearly segregated from the bulk of the data. 

Unfortunately, the clustering problems encountered in segmentation applica- 
tions are particularly challenging, as neither the number of clusters, nor their 
shape is known in advance. Moreover, clusters are frequently unbalanced (i.e. 
have widely different sizes) and often distinctly non-Gaussian (e.g. skewed). This 
heralds serious difficulties for most “classical” clustering algorithms that often 
assume that the number of clusters is known in advance (e.g. K-means), or even 
that the shape of the data-density is explicitely specified up to a small number of 
parameters that can be estimated from the data (e.g. Gaussian Mixture Models 
{GMM}). 
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Furthermore, strategies to estimate the number of clusters prior to, or con- 
current with, the actual clustering are of limited value as they tend to be biased 
towards solutions that favour spherical or elliptical clusters of roughly the same 
size. The root for this bias is to be found in the fact that almost all cluster- 
validity criteria compare variation within to variation between clusters (for more 
details we refer to standard texts such as |9lllll(H 'l. 

To circumvent the problems outlined above, we focus on clustering based 
on non-parametric density estimation (for prior work, see e.g. 1 6) ) . In 
contradistinction to parametric density estimation (such as GMM), no explicit 
parametric form of the density is put forward, and the data-density is obtai- 
ned by convolving the dataset by a density- kernel. More precisely, given an 
d-dimensional dataset {x^ e i = l...n} a density /(x) is obtained by 
convolving the dataset with a unimodal density- kernel K^(x): 



1 



/W = ^ 

n 



( 1 ) 



where a is the size-parameter for the kernel, measuring its spread. Although 
almost any unimodal density will do, one typically takes Ka to be a (rotation- 
invariant) Gaussian density with specifying its variance: 



ATct(x) = 



1 

27TCT^ 



d/2 



=-||x||V2<t" 



(2) 



After convolution we identify clusters by using gradient ascent (hill-climbing) to 
pinpoint local maxima of the density /. This procedure ends up assigning each 
point to a nearby density-maximum, thus carving up the data-set in compact 
and dense clumps. 

However, it is obvious that unless the width a is judiciously picked within a 
fairly narrow range, this procedure will result in either too many (if cr is chosen 
too small) or too few clusters (if a is set too large). Although a huge bulk of 
the work on density-estimation concerns itself with this problem of choosing an 
“optimal” value for a (e.g. see the book by Thompson and Tapia EHI), it is fair 
to say that it remains extremely tricky to try and estimate optimal (or even 
acceptable) clustering parameters. 

For this reason we propose a different approach: We start from a sub-optimal 
(too small) choice for tr, and then modify the resulting density / directly. The 
proposed modification (which will be detailed in Section l^) is based on the 
Kolmogorov- Smirnov statistic and the resulting criterion has therefore a precise 
and easy to grasp meaning, which does not involve arbitrarily chosen parameters. 

The rest of this paper is organised as follows. In Section O we will argue 
that performance of clustering is improved if the dimensionality of the problem 
can be meaningfully reduced. Rather than trying to combine all the information 
in one huge feature-vector, we will champion the view that it makes sense to 
look at as simple a feature as reasonable. This amounts to projecting the high- 
dimensional data-set on low-dimensional subspaces and is therefore similar in 
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spirit to Projection Pursuit, a technique used in data analysis, where projections 
on low-dimensional subspaces (1- or 2-dimensional) are used to gain insight into 
the structure of high-dimensional data. 

One particularly interesting and useful case of the aforementioned dimen- 
sion reduction is that of clustering one-dimensional data, which boils down to 
partitioning the corresponding histogram. This topic is discussed extensively in 
Section 0 for several reasons. First, although one can argue that this is just a 
special case of the general n-dimensional clustering problem, the topology of a 
1-dimensional (non-compact) space (such as M) is unique in that it allows a 
total order. As a consequence, the mathematical theory is well understood and 
yields sharp results. Furthermore, the 1-dimensional case furnishes us with a 
useful stepping stone towards the more complex high-dimensional case that will 
be discussed in Section 0 Finally, Section 0 will report on results obtained for 
colour segmentation. 

2 High-Dimensional Versus Low-Dimensional Clustering 

Like most statistical procedures, clustering in high-dimensional spaces suffers 
from the dreaded curse of dimensionality. This is true in particular, for density 
estimation, as even for large data sets, high-dimensional space is relatively empty. 

As a consequence the reliability and interpret ability of the resulting cluste- 
ring may be improved whenever it is possible to reduce the dimensionality of 
the problem. In particular, this argument indicates that it is often ill-advised 
to artificially increase the dimensionality of the problem by blindly concatena- 
ting feature-vectors into high-dimensional datapoints. More precisely, if there 
is no theoretical or prior indication that features are mutually dependent, it 
is advisable to cluster them separately. The reason for this is straightforward: 
if features xi,X 2 , . . . ,Xn are independent, then their joint probability density 
function factorizes into a product of 1-dimensional densities: 

f{xi,X2, ...,Xn)= fl{xi)f2{x2) ■ ■ ■ fn{Xn), (3) 

and interesting structure in the joint density / will also be apparant in (one 
of) the marginal densities fi. For instance, computing the mean and variance of 
the gray-values in a small window about every pixel produces two features at 
each pixel. However, for an unconstrained image there is no reason why these 
two features would be dependent. Therefore, it makes sense to cluster them 
separately, rather than confounding the problem by focussing exclusively on 
their joint distribution. 

In particular, there are a number of perceptually relevant dichotomies (e.g. 
dark versus bright, horizontal versus vertical, direction versus randomness, colou- 
red versus gray, textured versus flat, etc.) that can be captured mathematically in 
a relatively straightforward fashion, but that nevertheless yield important clues 
for segmentation. This means that it makes sense to start studying 1-dimensional 
densities (simple histograms) and this will be our main point of focus for most 
of this paper. 
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Indeed, one of the motivations for this work is the observation that lots of 
effort in computer learning and artificial intelligence focuses on ways of finding 
transformations (often non-linear ones) that vastly reduce the dimensionality of 
the problem. The assumption is that in many cases there is a relatively small 
set of so-called latent variables that capture the intrinsic structure of the pro- 
blem and by determining the intrinsic dimensionality of the data, these (hidden) 
variables are brought to the fore. Exponents of this approach are classical me- 
thodologies such as principal component analysis (PCA) and multi-dimensional 
scaling, but also more recent developments of similar flavour such as projection 
pursuit (PP), generative topographic mapping (GTM), Kohonen’s self- organising 
maps (SOM) and independent component analysis (ICA). The latter is actually 
looking for transformations that decouple different components such that the 
factorisation in eq. & is — at least approximately — realised. 

3 Histogram Segmentation and 1-Dimensional Clustering 

3.1 The Empirical Distribution Function 

In this section we will concentrate on finding clusters in a sample xi, . . . ,Xn of 
1-dimensional data. In principle, clustering 1-dimensional data by segmenting 
the histogram should be fairly straightforward: all we need to do is locate the 
peaks (local maxima) and valleys (local minima) of the data density (for which 
the histogram is an estimator) and position the cluster boundaries at the local 
minima. However, the problem is that the number and position of these local 
minima will strongly depend on the width of the histogram bins. An appropriate 
choice for this parameter is difficult to make. 

For this reason we have decided to use the cumulative density function (also 
called the distribution function) as the tool of choice for segmentation, since it 
allows a non-parametric approach (see below). We recall that for a stochastic 
variable X with density function /, the cumulative density (distribution) F is 
defined in terms of the probability P by 



F{x) 



P{X < x) 



J f{u)du 



— OO 



Of course, in most cases of interest the underlying density / is unknown and we 
proceed by using the empirical distribution Fn, which for a sample X \, . . . , A„ 
is given by 



Fn{x) 



#{i ■■ Xj <x} 
n 



(4) 



One can prove (see eg. m) that Fn is an adequate estimator of F, as for 
instance 






as 



at every continuity point x of F. 
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Compared to the histogram, the empirical distribution has a number of ad- 
vantages. First, it is parameter-free as it is completely determined by the data 
itself and there is no need to judiously pick values for critical parameters such 
as bin-width. Second, working with the cumulative density rather than with 
the density itself has the added benefit of stability. Indeed, the integration ope- 
ration which transforms / into F smooths out random fluctuations, thereby 
highlighting the more essential characteristics. And last but not least, using 
the distribution allows us to invoke the Kolmogorov-Smirnov statistic, a power- 
ful non-parametric test that can be used to compare arbitrary densities. This 
theme will be elaborated further in the next section. 



3.2 Non-parametric Density Estimation Using Kolmogorov-Smirnov 

To make good on our promise to proceed in a non-parametric fashion, we proceed 
by asking ourselves the question: What is the smoothest density g that is compa- 
tible with the data, in the sense that the corresponding cumulative distribution G 
is not significantly different from the empirical distribution F„ ? This is basically 
a reformulation of Occam’s razor and in that sense akin to the MDL-principle 
that has made several appearances in this context. To tackle this question we 
note that, recast in the appropriate mathematical parlance, it reads as follows 
(see Fig.^: Find the density g that solves the following constrained minimisation 
problem: 



minimize d>{g) = {g'{x))‘^dx, subject to sup |G(x)— F„(a;)| < e„, 

J X 

IR 

_ ( 5 ) 

where e„ is the critical value for the Kolmogorov-Smirnov statistic at an ap- 
propriate significance level, e.g. 5% (details regarding the Kolmogorov-Smirnov 
statistic can be found in section I.S..SII . 

As there is no straightforward closed form solution to this problem, we pro- 
ceed by invoking a gradient descent procedure, 

§ = -mg), ( 6 ) 



but this calls for a precise definition of the gradient of a functional. This concept 
is studied extensively in funtional analysis and we briefly remind the reader of the 
relevant definition (for more details, see e.g. Troutma.n j1 9j. p. 44). To motivate 
the approach we recall that in classical calculus, the rate of change of a function 
in a specified direction is obtained by taking the inner-product of the gradient 
and the unit-vector in the specified direction. Exactly the same procedure can 
be used for functionals: The standard inner product on function spaces is given 

by 




f{x)g{x) dx 



< f, 9 > 
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and the functional equivalent of a directional derivative is provided by the im- 
portant concept of the Gateaux derivative of <P at g in the direction of v: 



Dy^{g) := lim 

e -)-0 



<P{g + ev) - <P{g) d 



= + 



e=0 



( 7 ) 



Under quite mild regularity conditions one can prove that for each g there is a 
unique function Wg such that for all v, Dy<l>{g) =< Wg^v >. This function is 
called the gradient of at g and denoted by D<l>{g), resulting in the suggestive 
formula 

Dy(p{g) = < D<P{g),v > (for all v) (8) 

which is formally identical to the corresponding formula in standard vector cal- 
culus relating the gradient to an arbitrary directional derivative. 

It is now straightforward to compute the gradient for the functional in ( 0 . 
Plugging the explicit form of the functional <P into eq. o yields: 



Dy<l>{g) 



i [{g' + - g'^] dx 

lim - U2eg'v' + e'^v''^] dx 
2 Jjj, g'v' dx 



Next, using integration by parts and the assumption that the density function 
g and its derivatives vanish at infinity (a reasonable assumption for a density 
modelling a histogram), it immediately follows that 



Dy<P{g) 



-2 < g”,v > 



whence. 



D<P{g) = -2 



d'^g 

dx^ 



Therefore the gradient-descent method for the functional <P gives rise to the heat 
equation: 



dt 



(c appropriate conductivity coefhcient) (9) 



which suggests the following strategy to search for a minimum in eq. 0 : Take 
an initial (fine-grained, i.e. small bins) estimate g = go for the density, e.g. by 
constructing a histogram with small bins, or using a kernel estimator (as in (III) 
with cr sufficiently small. Next, subject g by plugging it into diffusion equation 
(0 with go as initial condition. After each diffusion step, compute the cumulative 
density 

X 

G{x) = J g{u)du 

— OO 

by (numerically) integrating g. Now stop the diffusion the moment the constraint 
in 0 is violated and use the final g as the estimate for the density for which 
valleys and peaks can be determined. 
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Although this approach has been implemented and yields very satisfactory 
results, we hasten to point out that there is no guarantee that the evolution 
equation m actually ends up at a minimum (even a local one). The reason for 
this is that although the functional is quadratic, the diffusion is stopped as soon 
as it hits (domain-boundary specified by) the constraint. In most cases it will be 
possible to further reduce the functional <P by sliding along the constraint. 

In fact, one obvious way for doing this would be to make the diffusion coef- 
ficient c in eqijSI) dependent on the Kolmogorov-Smirnov difference: 



The conductivity coefficient c is engineered to behave like a Gaussian function 
near the origin, but to drop smoothly to zero when the difference p approa- 
ches the critical distance e„. This ensures that the diffusion is stopped wherever 
the smoothed density is about to violate the constraint, whereas it can pro- 
ceed unhampered in locations where the Kolmogorov-Smirnov difference is still 
sufficiently small. In the actual implementation we used an even simpler compu- 
tational scheme to guarantee the same effect: whenever the evolving distribution 
hits the KS-boundary the conductance-coefficient c in the region sandwiched 
between the two ffanking minima was set to zero. This halts the smoothing in 
that region, but allows further reduction in complexity at other locations. 

The sole drawback is that the diffusion tends to displace minima, so that for 
an accurate location it might be worthwhile to locally refit. Alternatively, one 
can simply pick the location of the actual minimal value (of the original data) in 
a small neighbourhood of the suggested minimum or trace it back to the original 
data. 

3.3 Confidence Band Based on Kolmogorov-Smirnov Statistic 

To implement the rationale underlying eq. (0 and amplified in the preceding 
section we still need to specify a principled way to determine the amount of 
acceptable deviation |G(a;) — F„(x)|. To this end we introduce the Kolmogorov- 
Smirnov statistic which directly compares distribution functions (eg. see ini)- 
More precisely, if Fn{x) is the cumulative distribution for a sample of size n 
drawn from F, the Kolmogorov-Smirnov test-statistic is defined to be the L°°- 
distance between the two functions: 



p{x) = |G(a:) - Fn{x)\ 



yielding a non-linear diffusion: 




(0 < p < e„). 

( 10 ) 



Dn = sup \Fn{x) - F(a:)| 



( 11 ) 



x^lR 

for which the p-value can be computed using: 



P{F>n > C) = QKsiVnC), 



(12) 
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Fig. 1. Segmenting densities (histograms) using the cumulative density. Left: The em- 
pirical cumulative density F„ flanked by its Kolmogorov-Smirnov confidence bands 
Fn ± €n, together with the smoothed cumulative density G that fits within the band. 
Right: The corresponding densities (obtained by differentiation). 



where 

OO 

fc=i 

(A reference can be found in Mood et.al. EH)- However, the alternating 
character makes this series expansion rather unwieldy to use, and we therefore 
hark back to Good [B| who proved the following approximation. First, define the 
one-sided difference 



Dt = sup(F„(x) - F{x)) and = sup{F{x) - F„(x)), 

X X 

then Good showed that under the null-hypothesis (i.e. if does indeed cor- 
respond to a sample taken from the underlying distribution F"), both statistics 
and D~ are identically distributed and tend to the following asymptottic 
distribution (for n sufficiently large): 

^ Xl (13) 



This approximation is eminently useful as it provides us with an handle to com- 
pute the boundary in eq. ( 0 . More precisely, we pick e„ so that under the 
null- hypothesis, it is unlikely that the KS-distance exceeds tn'- 



P{D^ > e„) = a where e.g. a = 0.05 or 0.1. (14) 



Selecting a critical point Cq, for the y|-distribution such that P(x| > c„) = 
we see that the probability in eq. can be rewritten as P(4nH+^ > 4ne^) 
a whence = Caj^n. We therefore conclude that the bound in eq.0 
determined by 






1 

2 



a 

is 



where P{X 2 > Cq.) = a. 



(15) 
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The only point that needs further amplification concerns the fact that we are in- 
terested in statistics on the two-sided distance D, whereas eq. m yields bounds 
on the one-sided distances D~^ or D~ . However, since 

P{D >0 = P{{D+ > C) or {D~ > 0) 

< P{D+ >i) + P{D- > 0 
= 2P{D+ > 

Hence, we see that we get a (conservative) confidence bound if we set e„ in eq.(0 
to be equal to 



^ > Ca/ 2 ) = a/2. (16) 

3.4 Comparison to Fitting Gaussian Mixture Models 

Fitting a Gaussian Mixture Model (GMM) is probably the most popular me- 
thod to partition a histogram into a unknown number of groups. If the num- 
ber of clusters is known in advance, one can take recourse to the well-known 
Expectation-Maximisation algorithm (EM) 0 to estimate the corresponding pa- 
rameters (ie. mean, variance and prior probabilities of each group). However, 
caution is called for as the sensitivity of the EM-algorithm to its initialisation is 
well-documented: Initially assigning a small number of “outliers” to the wrong 
group (albeit with small probability) often lures the algorithm to an erroneous 
local likelihood minimum, from which it never recovers. 

The second problem has to do with the fact that the number of groups 
isn’t known in advance and needs to be determined on the fly. Obviously, ma- 
ximum likelihood methods are unable to extract the number of clusters as the 
likelihood increases monotonically with the number of clusters. One possibility, 
proposed by Carson et.al. | 2 |, is to use a criterion based on Minimum Descrip- 
tion Length (MDL). The idea is combine the likelihood of the data with respect 
to a (Gaussian mixture) model with a penalty term that grows with the number 
of parameters that need to be determined to fit the model. More precisely, for 
a sample x of size n they choose the number K of components in the Gaussian 
mixture (determined by parameters 9) by maximisizing 

L(6» I x) - /3 ^logn (17) 

where mx is the number of free parameters needed for a model with K Gaussian 
(d-dimensional) mixture components: 

rriK = {K-1)+Kd+K ‘^^‘^^ 

(The significance of the /9-factor will be discussed presently). 

There are two, potentially serious, problems. First, there are the aforemen- 
tioned problems regarding the instabilities inherent to the EM-algorithm. But 
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even if the EM-algorithm is successful in identifying the underlying mixture, 
there is the need for an adhoc factor (3 to balance out the contribution from 
both cost-terms in eq. (ED, as they may differ by an order of magnitude. 

One could of course object that the fudge-factor j3 is comparable to the 
parameter a that needs to be fixed in the KS-approach. But there is an important 
difference: unlike /3, the factor a specifying the confidence level has a clear and 
operational meaning in terms of the risk of committing a type-I error and this 
risk needs to be fixed in any statistical approach to data-analysis. 

In all fairness we need to point out that there is one situation in which the 
EM-algorithm yields a more satisfying result than the non-parametric approach. 
Whenever we have two Gaussian densities that encroach on one another, there 
is a possibility that the global density shows two ill-separated bumps without 
a clearcut minimum. In such cases EM has little difficulty extracting the indi- 
vidual Gaussians (granted of course, that the number of Gaussians is specified 
beforehand). As there is no minimum in the original density, our method will 
have no alternative but to lump the Gaussians together in one cluster. 

Having said that, it is also worthwhile to point out that there are situations 
where EM will fail to deliver the goods while the non-parametric approach has 
no difficulty whatsoever The simplest example is a uniformly distributed density. 
In an attempt to come up with a good approximation to this flat density, the 
EM-algorithm has no other option but to insert a variable number of Gaussians, 
resulting in a excessive fractioning of the cluster. 

In conclusion we can say the EM-algorithm for GMM is a typical example 
of a parametric approach to density estimation. As such it enjoys an advantage 
over a non-parametric approach (such as the one detailed in this paper) whene- 
ver there is clear evidence that the underlying data-distribution is well modeled 
by the proposed parametrised density. However, in typical image-segmentation 
problems such an assumption is seldomly warranted and consequently, EM is 
almost invariably outperformed by the proposed non-parametric histogram seg- 
mentation. 



4 Some Experimental Results 

We also tested this strategy on a number of challenging colour images (see 
Figs. 0. In keeping with the spirit of our approach we project each image on the 
axes of a number of different colour-spaces (such as RGB, rgb, and opponent- 
colours). This yields for each image 9 histograms which are all segmented. The 
resulting histogram clusterings can easily be scored by marking whether there is 
more than one cluster (uninteresting) and if so, how well-separated and pronoun- 
ced these clusters are (e.g. by comparing their mean distance to their variance). 
In the experiments reported below we display for each image the two most salient 
histograms. More precisely, the original colour images (left), together with two 
histograms obtained by projection on an appropriate colour-axis (the choice of 
which is image dependent) and the resulting image segmentation based on the 
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segmentation of the histogram. It is clear that combining the information from 
the different projections often yields very acceptable segmentation results. 

To enhance the robustness of the segmentation we apply two simple pre- 
processing steps: 

1. Slight diffusion of the colours in the original image; apart from reducing noise 
it introduces some sort of spatial correlation into the statistics and therefore 
compensates for the fact that spatial information is completely lost when 
mapping pixels into colour-spaces. 

2. Global perturbation of the 1-dimensional data by adding independent Gaus- 
sian noise to all the datapoints: 

Xi — Xj 6i 

where Si ^ are independent and the standard deviation a is taken 

to be a fraction of the data range R: 

a = yi? (typically, 7 = 0.01). 

The reason for introducing this perturbation is that it resolves ties and re- 
moves artifacts due to quantisation, thus improving the final results. 

It goes without saying that segmentation based on a single 1-dimensional 
histogram will only reflect a particular visual aspect (if any at all), and as such 
only has a very limited range of applicability. However, we contend that as 
different aspects are highlighted by different histograms, combinations of the 
regions thus obtained will yield complementary information. 

This topic will be taken up in a forthcoming paper but for now, let us just 
point out that it is helpful to think of the segmentation results for the one- 
dimensional histograms as some sort of spatial binding. If for some feature pixels 
are mapped into the same region, then they are in effect “bound together” in the 
sense that, with respect to that particular feature, they are very similar. In this 
way, each different projection (feature) imposes its own binding-structure on the 
pixels and pixels that are often “bound together” in the same region therefore 
accrue a lot of mutual spatial correlation. This spatial correlation structure can 
be used to improve segmentation or to suggest to the user a number of different 
possible segmentations, the correlation structure detailing for each of them their 
statistical support. 

5 Extensions to Higher Dimensions 

The main thrust of the argument in this paper was based on the Kolmogorov- 
Smirnov distance, and it is therefore of interest to note that there is multi- 
dimensional extension of sorts for the KS-statistic. This opens up the possibility 
to extend this approach to higher dimensions, always bearing in mind of course 
that the dimension should not be inflated without proper reason. 
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The generalisation of the distribution function for a d-dimensional stochastic 
variable X is straightforward: 

F(x) := P(X < x) = P(Xi <xi,...,Xa<Xd) (18) 

For the sake of brevity, we limit ourselves here to formulating the relevant theo- 
rem (for more details, see 1101): 

For any e > 0 there exists a sufficiently large uq such that for n > uq the 
inequality 

P jsup |P(x) - P„(x)| > e| < (19) 

holds true, where a is any constant smaller than 2. 

Notice how this result falls short of mathematical solidity and elegance enjoyed 
by the 1-dimensional result lfT2fl . First of all, having to deal with an inequality 
rather than an equality means that we are only given an upperbound for the 
probability. Furthermore, as stated above, the result is akward to use as it pon- 
tificates the existence of an appropriate sample size (n), given a KS-distance e. 
However, in practice the sample size is fixed in advance and there is little scope 
for an asymptotic expansion. In fact, for most realistic sample sizes, the specified 
upper bound is much larger than 1 and therefore of little use. 

These theoretical proviso’s notwithstanding, there is no good reason why a 
strategy similar to the one expounded in section 0 cannot be explored in higher 
dimensions, if we are willing to shoulder a higher computational burden. More 
specifically we propose the following algorithm to cluster d-dimensional data. 

Algorithm Given a sample Xi, . . . ,x„ in 

1. Compute for each x^ the empirical distribution function F'„(xi) = #{xfc | 
Xfe < Xi}/n (the ordering relation is defined component-wise, as in eg. 1181 1. 
Next, pick a small initial value for cr; 

2. Use eq. © to construct the kernel-estimate fa for the density. In order to 
evaluate the KS-statistic we need the corresponding cumulative density Fa 
which can be obtained by integration: 

1 ” 7 

= - E / Ka{f-^^)d^ ( 20 ) 

-oo 

If the kernel Ka is a rotation-invariant Gaussian (|2) (actually the most 
common choice), then its integral can be straightfowardly expressed in terms 
of products of the error-function erf (x) , and dlOI) therefore yields an explicit 
expression. 

3. Compute the KS-distance between the proposed distribution Fa and the 
empirical one supported by the actual data: 



D{Fa) = sup |Fj^(xi) - U„(xi)| 
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4. To assess how (un) acceptable this result is we need to compute the p- value 

for D(F^), ie. we need to compute the probability that a sample from F„ will 
yield a value at least as large as D{F„). To this end we draw M samples of 
size n from F„ and construct for each of them the corresponding empirical 
F^\{m = and the associated distance Ranking D{F^) 

relative to the sequence m = 1, . . . ,M} yields an estimate for the 

required p- value. (Note that since F^ is based on a convolution (P), sampling 
from this distribution is straigthforward: first pick a data-point at random 
and next, sample from the Gaussian centered at x^.) 

5. Finally, if the p-value thus obtained indicates that there is still room to 
further increase a (ie. to further smooth /), do so and return to step 2. Notice 
how we can change a globally (which amounts to a global smoothing), or 
locally at those locations where KS-difference indicates that there is further 
leeway for data-smoothing. This is the multi-dimensional equivalent of the 
non-linear smoothing proposed in eq. dinj. 

6 Conclusion and Outlook 

In this paper we have introduced a non-parametric clustering algorithm for 1- 
dimensional data,. The procedure looks for the simplest (i.e. smoothest) density 
that is still compatible with the data. Compatibility is given a precise meaning in 
terms of the Kolmogorov-Smirnov statistic. This approach is therefore genuinely 
nonparametric and does not involve fixing arbitrary cost- or fudge- factors. 

We have argued that it often makes sense to look for salient regions by 
investigating projections on appropriate 1-dimensional feature-spaces, which are 
inspected for evidence of clusters. We note in passing that this provides us with 
a operational tool for automatic and data-driven selection of promising features: 
a feature is deemed interesting (for the image under scrutiny) whenever it gives 
rise to a non-trivial clustering. Finally, we have outlined how the results obtained 
in the 1-dimensional case can be generalised to higher-dimensional settings. 
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Fig. 2. Original colour images (left), together with two histograms obtained by pro- 
jection on an appropriate colour-axis (the choice of which is image dependent) and the 
resulting image segmentation based on the segmentation of the histogram. 
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Abstract. We describe a method for computing the likelihood that a 
completion joining two contour fragments passes through any given posi- 
tion and orientation in the image plane, that is, a method for completing 
the boundaries of partially occluded objects. Like computations in pri- 
mary visual cortex (and unlike all previous models of contour completion 
in the human visual system), our computation is Euclidean invariant. 
This invariance is achieved in a biologically plansible manner by repre- 
senting the input, output, and intermediate states of the computation in 
a basis of shiftable-twistable functions. The spatial components of these 
functions resemble the receptive fields of simple cells in primary visnal 
cortex. Shiftable-twistable functions on the space of positions and direc- 
tions are a generalization of shiftable-steerable functions on the plane. 



1 Introduction 

Any computational model of human visual information processing must reconcile 
two apparently contradictory observations. First, computations in primary visual 
cortex are largely Euclidean invariant — an arbitrary rotation and translation of 
the input pattern of light falling on the retina produces an identical rotation and 
translation of the output of the computation. Second, simple calculations based 
on the size of primary visual cortex (60 mm x 80 mm) and the observed density 
of cortical hypercolumns (4/mm^) suggest that the discrete spatial sampling of 
the visual field is exceedingly sparse I2H- The apparent contradiction becomes 
clear when we ask the following questions: How is this remarkable invariance 
achieved in computations performed by populations of cortical neurons with 
broadly tuned receptive fields centered at so few locations? Why doesn’t our 
perception of the world change dramatically when we tilt our head by 5 degrees 10 

^ Ulf Eyesel asks a related question in a recent Nature paper 1^: 

“On average, a region of just 1 mm^ on the surface of the cortex will contain 
all possible orientation preferences, and, accordingly, can analyze orientation 
for one small area of the visual field. This topographical arrangement allows 
closely spaced objects with different orientations to interact. But it also means 

D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. inO- TiTfil 2000. 
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(a) 



(b) 




Fig. 1. (a) Ehrenstein Figure (b) Kanizsa Triangle 



One of the main goals of our research is to show how the sparse and see- 
mingly haphazard nature of the sampling of the visual field can be reconciled 
with the Euclidean invariance of visual computations. To realize this goal, we 
introduce the notion of a shiftable-twistable basis of functions on the space, 
R2 X 5'^, of positions and directions. This notion is a generalization of the notion 
of a shiftable-steerable basis of functions on the plane, R^, introduced by Free- 
man, Adelson, Simoncelli, and Heeger in two seminal papers piill Sj . Freeman and 
Adelson jH] clearly appreciated the importance of the issues raised above when 
they devised the notion of a steerable basis to implement rotationally invariant 
computations. In fact, for computations in the plane the contradictions discus- 
sed above were largely resolved with the introduction by Simoncelli et al. HSj 
of the shiftable-steerable pyramid transform, which was specifically designed to 
perform Euclidean invariant computations on R^. The basis functions in the 
shiftable-steerable pyramid are very similar to simple cell receptive fields in pri- 
mary visual cortex. However, many computations in VI and V2 likely operate 
on functions of the space of positions and directions, R^ x 5'^, rather than on 
functions of the plane, R^ (e.g., ll8l9ll3ll6llVl2ll22-E5l l. Consequently, we pro- 
pose that shiftability-twistability (in addition to shiftability-steerability) is the 
property which binds sparsely distributed receptive fields together functionally 
to perform Euclidean invariant computations in visual cortex. 

In this article, we describe a new algorithm for completing the boundaries of 
partially occluded objects. This algorithm is based on a computational theory of 
contour completion in primary and secondary visual cortex developed in recent 
years by Williams and colleagues [II 9120121 122| . Like computations in VI and V2, 
and unlike previous models of illusory contour formation in the human visual 
system, our computation is Euclidean invariant. This invariance is achieved by 



that a continuous line across the whole visual field would be cortically depicted 
in a patchy, discontinuous fashion. How can the spatially separated elements 
be bound together functionally?” 
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representing the input, output, and intermediate states of the computation in a 
basis of shiftable-twistable functions. 

Mumford uni proposed that the probability distribution of natural shapes 
can be modeled by particles traveling with constant speed in directions given by 
Brownian motions. More recently, Williams and Jacobs m defined the stochastic 
completion field to be the distribution of particle trajectories joining pairs of 
position and direction constraints, and showed how it could be computed in a 
neural network. 

The neural network described in m is based on Mumford’s observation that 
the evolution in time of the probability density function (p.d.f.) representing 
the position, (x,y), and direction, 9, of the particle can be modeled as a set 
of independent advection equations acting in the (cc, y) dimension coupled in 
the 9 dimension by the diffusion equation US]. Unfortunately, solutions of this 
Fokker-Planck equation computed by numerical integration on a rectangular 
grid do not exhibit the robust invariance under rotations and translations which 
characterizes the output of computations performed in primary visual cortex. 
Nor does any other existing model of contour completion, sharpening, or saliency 

(e.g., |8|9|l,‘il()|17|21|22E5]). 

Our new algorithm computes stochastic completion fields in a Euclidean 
invariant manner. Figure H (left) is a picture of the stochastic completion field 
due to the Kanizsa Triangle stimulus in Figure m- Figure |2| (right) shows 
the stochastic completion field due to a rotation and translation of the (input) 
Kanizsa Triangle. The Euclidean invariance of our algorithm can be seen by 
observing that the (output) stochastic completion field on the right in Figure 0 
is itself a rotation and translation of the stochastic completion field on the left, 
by the same amount. 

2 Relevant Neuroscience 

Our new Euclidean invariant algorithm was motivated, in part, by the following 
experimental findings. To begin with, the receptive fields of simple cells, which 
have been traditionally described as edge (or bar) detectors, can be accurately 
modeled using two-dimensional Gabor functions which are the product 

of a Gaussian (localized in position) and a harmonic grating (localized in ori- 
entation and spatial frequency). Gabor functions are well suited to the purpose 
of encoding visual information, since, by the Heisenberg Uncertainty Principle, 
they are the unique functions which are maximally localized in both space and 
frequency. 

The sampling of the visual field in VI is quite sparse — there are about about 
100 X 100 hypercolumns, with receptive fields of about 5 scales and 16 orientations 
in each hypercolumn. Neglecting size (and phase), a simple cell receptive field can 
be parameterized by its position and orientation. The spatial distribution of these 
two parameters, known as orientation preference structure, is an attempt (on the 
part of evolution) to smoothly map the three-dimensional parameter space, x 
S^, of edge positions and orientations onto the two-dimensional surface, of the 
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Fig. 2. Stochastic completion fields: Of Kanizsa Triangle {left) and after the initial 
conditions have been rotated and translated {right) 



visual cortex, . Due to the differences in dimensionality, orientation preference 
structure is punctuated by so-called pinwheels, which are the singularities in this 
mapping PJ. 

As a first approximation, a neuron’s response to an arbitrary grey-level image 
can be modeled as the L^-inner product of the image with the neuron’s recep- 
tive field. These experimental observations suggested to Daugman P] that an 
ensemble of simple cell receptive fields can be regarded as performing a wavelet 
transform of the image, in which the responses of the neurons correspond to the 
transform coefficients and the receptive fields correspond to the basis functions. 

Recent experiments have demonstrated that the response of simple cells in 
VI can be modulated by stimuli outside the classical receptive fields. Apart 
from underscoring the limitations of the classical (linear) model, they suggest 
a function for the long-range connections which have been observed between 
simple cells. For example, in a recent experiment, Gilbert 0 has demonstrated 
that a short horizontal bar stimulus can modulate the response of simple cells 
whose receptive fields are located at a significant horizontal distances from the 
bar, and which have a similar orientation preference to the bar. Non-linear long- 
range effects have also been observed in secondary visual cortex. For example, 
von der Heydt et al. HD] reported that the firing rate of certain neurons in V2 
increases when their “receptive fields” are crossed by illusory contours (of specific 
orientations) which are induced by pairs of bars flanking the receptive field. 
Significantly, these neurons do not respond to these same bars presented in 
isolation — they only respond to pairs 0 

^ These experiments suggests that the source and sink fields, which are intermediate 
representations in Williams and Jacobs model of illusory contour formation, could be 
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Although our new contour completion algorithm does not provide a model for 
illusory contour formation in the brain which is realistic in every respect, it does 
have several features which are biologically plausible, none of which are found 
in previous algorithms, e.g., |tilHlldllbll7l21l22l25| . These features are that (1) 
all states of the computation be represented in a wavelet-like basis of functions 
which are localized in both space and frequency (spatial localization allows the 
computation to be performed in parallel); (2) the computation operates on the 
coefficients in the wavelet-like transform and can be implemented in a neural 
network; (3) the computation is Euclidean invariant; and (4) it is accomplished 
using basis functions with centers lying on a (relatively) sparse grid in the image 
plane. 



3 Shiftable-Twistable Bases 

Many visual and image processing tasks are most naturally formulated in the 
continuum and are invariant under a group of symmetries of the continuum. The 
Euclidean group, of rotations and translations, is one example of a continuous 
symmetry group. However, because discrete lattices are not preserved by the 
action of continuous symmetry groups, the natural invariance of a computation 
can be easily lost when it is performed in a discrete network. In this section we 
will introduce the notion of a shiftable-twistable basis and show how it can be 
used to implement discrete computations on the continuous space of positions 
and directions in a way which preserves their natural invariance. 

In image processing, the input and output are functions on R^, and the 
appropriate notion of the invariance of computations is Euclidean invariance — 
any rotation and translation of the input should produce an identical rotation 
and translation of the output. Simoncelli et al. introduced the notion of a 

shiftable-steerable basis of functions on R^, and showed how it can be used to 
achieve Euclidean invariance in discrete computations for image enhancement, 
stereo disparity measurement, and scale-space analysis. 

Given the nature of simple cell receptive fields, the input and output of 
computations in primary visual cortex are more naturally thought of as functions 
defined on the continuous space, R^ x of positions, x = (x,y), in the plane, 
R^, and directions, 9, in the circle, S^. For such computations the appropriate 
notion of invariance is determined by those symmetries, T^g^Og, of R^ x , which 
perform a shift in R^ by Xq, followed by a twist in R^ x through an angle, 
6q. a twist through an angle, 6q, consists of two parts: (1) a rotation, i?gg, of R^ 
and (2) a translation in S^, both by Oq. The symmetry, T^gfig, which is called a 
shift-twist transformatioi^ is given by the formula, 

T(a^g,9g){x,e) = {Reg{x - xo) , 9 - 9o) . (3.1) 

represented by populations of simple cells in VI, and that the stochastic completion 
field, which is the product of the source and sink fields, could be represented in V2. 
® The relationship between shift-twist transformations and computations in VI was 
described by Williams and Jacobs in and more recently by Kalitzin et al. m 
and Cowan [2|. 
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A visual computation on x is called shift-twist invariant if, for all (xqj ^o) G 

X a shift-twist of the input by (a;o,0o) produces an identical shift-twist 
of the output. 

Correspondingly, we define a shiftable-twistable &asz.0of functions on R^ x 
to be a set of functions on R^ x with the property that whenever a function, 
P{x, 0), is in their span, then so is P{Txo, 0 g{x, 0)), for every choice of (xq, ^o) in 
R^ X As such, the notion of a shiftable-twistable basis on R^ x generalizes 
that of a shiftable-steerable basis on R^. 

Shiftable-twistable bases can be constructed as follows. First we recall Si- 
moncelli’s concept of the shiftability of a function, which is closely related to the 
Shannon- Whittaker Sampling Theorem. A periodic function, tfix), of period X, 
is shiftable if there is an integer, K, such that the shift of if by an arbitrary 
amount, Xq, can be expressed as a linear combination of AT basic shifts of if, i.e., 
if there exist interpolation functions, bk{xo), such that 

if{x-xo) = J2k=o bk{xo) if{x - kA) , (3.2) 

where A = X/K is the basic shift amount. The simplest shiftable function 
in one dimension is a pure harmonic signal, in which case K = 1. More 
generally, Simoncelli et al. d proved that any band-limited function is shiftable. 
In fact, if the set of non-zero Fourier series frequencies of ■;/' is (a subset of) 
B = {wojWo + l, . . . , wo + AT — 1}, then if can be shifted using the K interpolation 
functions, bk(xo) — b{xo—kA), where b{x) is the complex conjugate of the perfect 
bandpass filter constructed from the set of K frequencies, B. In particular, note 
that the interpolation functions only depend on the set of non-zero frequencies 
of if, and not on if itself. 

Strictly speaking, since they are not band-limited, functions such as Gabors 
are not shiftable. Nevertheless, for all intents and purposes, they can be shifted 
by choosing the set, B, to consist of all Fourier series frequencies, u>, of if, such 
that the Fourier amplitude, \if{u;)\, is essentially non-zero (i.e., it exceeds some 
small threshold value). Such functions will be called effectively shiftable. 

Let ^{x,0) be a function on R^ x which is periodic (with period A) in 
both spatial variables, x. In analogy with the definition of a shiftable-steerable 
function on R^, we say that is shiftable-twistable on R^ x if there are 
integers, K and M, and interpolation functions, bf. „i(xQ, 9 q), such that, for each 
(xq, 0q) € R^ X S^, the shift-twist of lb' by (®o, 0o) is a linear combination of a 
finite number of basic shift-twists of lb' by amounts {kA,mA 0 ), i.e., if 

'P{TxaPo{x,e)) = Y.k, 7 n bk,m{xo,0o) P{TkA,mAe{x,0)) . (3.3) 

Here A = XjK is the basic shift amount and Ag = 2tt/M is the basic twist 
amount. The sum in equation is taken over all pairs of integers, k = (kx,ky), 
in the range, 0 < kx,ky < K, and all integers, m, in the range, 0 < m < M. 
For many shiftable-twistable bases, the interpolation functions, bk^m{xQ, 6 q), are 
defined in terms of Simoncelli’s one-dimensional interpolation functions, bk{xo). 

We use this terminology even though the basis functions need not be linearly inde- 
pendent. 
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The simplest example of a shiftable-steerable basis is the Gaussian-Fourier 
basis, which is the product of a shiftable-steerable basis of Gaussians in x 
and a Fourier series basis in 9. Let g(x) = be a radial Gaussian of 

standard deviation, v. We regard g as a periodic function of period, X, which is 
chosen to be much larger than i/, so that g defines a smooth periodic function. 
For each frequency, to, we define Gui{x,6) = g{x)e'“^^. Also, given a choice of a 
shift amount. A, so that K = X/ A is an integer, we define the Gaussian-Fourier 
basis functions, G^ ,^, by 



GkA^,9) = g{x-kA) . (3.4) 

The following proposition implies that the Gaussian-Fourier basis is shiftable- 
twist able. 

Proposition 1. The periodic function, G^j, (of period X) is effectively shiftable- 
twistable. More precisely, let M = 1 and let K be the number of essentially 
non-zero Fourier series coefficients of the factor, gx{x) = 1'^'^ , of g{x). 

Then, 



Guj{T„.^,eo{x,9)) = Y. bk,u^{xo,9o) Gk,uj{x,9) , (3.5) 

k 

where the interpolation functions are given by 

bk,Lj{xo,9o) = bk{xo) . (3.6) 

Ftere bk(xo) = b(xo — kA), where b{xo) is the complex conjugate of the perfect 
bandpass filter constructed from the set of essentially non-zero Fourier series 
coefficients, rj, of g{x). 

For certain computations, the input can be easily represented in a Gaussian- 
Fourier basis. For example, suppose that the input is modeled as a linear com- 
bination of fine scale three-dimensional Gaussians, centered at arbitrary points, 
(£Co,0o)) in X 5'^- Since the input is the product of a Gaussian in x and a 
Gaussian in 9 it can be represented in a single scale Gaussian-Fourier basis as 
follows. First, the Gaussian in 9 is represented in the Fourier basis using the 
standard analysis and synthesis formulae for Fourier series. Second, if the two- 
dimensional input Gaussians in x are chosen to be shifts of the basis function, 
g{x), then we can use Proposition Q to represent the input Gaussians in x in the 
Gaussian basis. 

A somewhat more biologically plausible basis is the complex directional deri- 
vative of Gaussian (CDDG)-Fourier basis, which is very similar to the previous 
example, except that the Gaussian, g{x), is replaced by its complex directional 
derivative in the direction of the complex valued vector, [1, . The GDDG looks 

more like the receptive field of a simple cell in VI than a Gaussian does. Also 
the GDDG is a wavelet, whereas the Gaussian is not. 
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4 Stochastic Completion Fields 



In their computational theory of illusory contour formation, Williams and Jac- 
obs jzg argued that, given a prior probability distribution of possible comple- 
tion shapes, the visual system computes the local image plane statistics of the 
distribution of all possible completions, rather than simply the most probable 
completion. This view is in accord with human experience — some illusory con- 
tours are more salient than others, and some appear sharper than others. They 
defined the notion of a stochastic completion field to model illusory contours in 
a probabilistic manner. The stochastic completion field is a probability density 
function (p.d.f.) on the space, x S^, of positions, x = (x,y), in the plane, 
R^, and directions, 9, in the circle, S^. It is defined in terms of a set of posi- 
tion and direction constraints representing the beginning and ending points of 
a set of contour fragments (called sources and sinks), and a prior probability 
distribution of completion shapes, which is modeled as the set of paths followed 
by particles traveling with constant speed in directions described by Brownian 
motions m The magnitude of the stochastic completion field, C{x,0), is the 
probability that a completion from the prior probability distribution will pass 
through {x,9) on a path joining two of the contour fragments. Williams and 
Jacobs m showed that the stochastic completion field could be factored into a 
source field and a sink field. The source field, P'(x, 9), represents the probability 
that a contour beginning at a source will pass through {x,9) and the sink field, 
Q'{x,9), represents the probability that a contour beginning at {x,9) will reach 
a sink. The completion field is 



C{x,9) = P'{x,9)- Q'{x,9) . (4.1) 

The source (or sink) field itself is obtained by integrating a probability density 
function, P{x,9\t), over all positive times, t, where P{x,9\t) represents the 
probability that a particle beginning at a source reaches {x,9) at time t. 



P'{x,9) = / P{x,9',t)dt . 
0 



(4.2) 



Mumford HSl observed that P evolves according to a Fokker-Planck equation of 
the form. 



dP 

dt 



— COS 
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— Sin 
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ip 



(4.3) 



where the initial probability distribution of sources (or sinks) is described by 
P{x,9 ;0). This partial differential equation can be viewed as a set of independent 
advection equations in a; = (x, y) (the first and second terms) coupled in the 9 
dimension by the diffusion equation (the third term). The advection equations 
translate probability mass in direction 9 with unit speed, while the diffusion 
term models the Brownian motion in direction, with diffusion parameter, a. The 
combined effect of these three terms is that particles tend to travel in straight 
lines, but over time they drift to the left or right by an amount proportional to 
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(T^. Finally, the effect of the fourth term is that particles decay over time, with 
a half life given by the decay constant, t. This represents our prior expectation 
on the length of gaps — most are quite short. In ^21 stochastic completion fields 
were computed by solving the Fokker-Planck equation using a standard finite 
differencing scheme on a regular grid. 



5 Description of Algorithm 



One of the main goals of this paper is to derive a discrete numerical algorithm 
to compute stochastic completion fields in a shift-twist invariant manner. This 
invariance is achieved by first evolving the Fokker-Planck equation in a shiftable- 
twistable basis of x to obtain representations of the source and sink fields 
in the basis, and then multiplying these representations in a shift-twist invariant 
manner to obtain a representation of the completion field in a shiftable-twistable 
basis. 

We observe that a discrete Dirac basis, consisting of functions, 'Fk^rn{x,9) = 
6{x — kA)S{9 — mAg), where {k,m) is a triple of integers, is not shiftable- 
twistable. This is because a Dirac function located off the grid of Dirac basis 
functions is not in their span. 

A major shortcoming of previous contour completion algorithms tSIfildllT)! 

is that they perform computations in this basis. As a consequence, 
initial conditions which do not lie directly on the grid cannot be accurately re- 
presented. This problem is often skirted by researchers in this area by choosing 
input patterns which match their choice of sampling rate and phase. For exam- 
ple, Li H2| used only six orientations (including 0°) and Heitger and von der 
Heydt |2|, only twelve (including 0°, 60° and 120°). Li’s first test pattern was a 
line of orientation, 0°, while Heitger and von der Heydt used a Kanizsa Triangle 
with sides of 0°, 60°, and 120° orientation. There is no reason to believe that the 
experimental results they show would be the same if their input patterns were 
rotated by as little as 5°|j 

In addition to the problem of representing the input, the computation itself 
must be Euclidean invariant. Stochastic completion fields computed using the 
finite differencing scheme of 1221 exhibit marked anisotropic spatial smoothing 
due to the manner in which 2D advection is performed on a grid (see Figures 141^1 
andEI) . Although probability mass advects perfectly in either of the two principal 
coordinate directions, mass which is moving at an angle to the grid gradually 
disperses, since, at each time step, bilinear interpolation is used to place the 
mass on the grid. 

For reasons of simplicity, in this paper, we chose to compute stochastic com- 



pletion fields in a Gaussian-Fourier basislj The initial conditions for the Fokker- 



® Nor are we blameless in this respect. Williams and Jacobs |21l22j used 36 directions 
(including 0°, 60° and 120°) and demonstrated their computation with a Kanizsa 
Triangle with sides of 0°, 60° and 120° orientation. 

® The computation can also be performed in more biologically plausible shiftable- 
twistable bases, the simplest of which is the CDDG-Fourier basis. 
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Planck initial value problem are modeled by fine scale, three-dimensional Gaus- 
sians, whose centers are determined by the locations and directions of the edge 
fragments to be completed. We use the single scale method discussed in Section 0 
to represent the initial conditions in the basis. 

To solve the Fokker-Planck equation, we express its solution in terms of the 
basis functions, Gk,uj{x,9), as 

P{x,e-,t) = Y.k,u Ck,u{t)Gk,u>{x,e) , (5.1) 

where the coefficients, Ck,ui{i), depend on time. Then, we derive a linear trans- 
formation, c{t + At) = (AoD)c(t), to evolve the coefficient vector in time. This 
transformation is the composition of an advection transformation. A, which has 
the effect of transporting probability mass in directions 9, and a diffusion-decay 
transformation, D, which implements both the diffusion of mass in 0, and the 
decay of mass over time. Representations of source or sink fields in the basis are 
obtained by integrating the coefficient vector, c(t), over time, where the initial 
coefficient vector represents the initial sources or sinks. 

The shiftability-twistability of the basis functions is used in two distinct ways 
to obtain shift-twist invariant source and sink fields. First, it enables any two 
initial conditions, which are related by an arbitrary transformation, 
be represented equally well in the basis. Second, it is used to derive a shift-twist 
invariant advection transformation, A, thereby eliminating the grid orientation 
artifacts described above. In summary, given a desired resolution at which to 
represent the initial conditions, our new algorithm produces source and sink 
fields, at the given resolution, which transform appropriately under arbitrary 
Euclidean transformations of the input image. In contrast, in all previous contour 
completion algorithms, the degree of failure of Euclidean invariance is highly 
dependent on the resolution of the grid, and can be quite large relative to the 
grid resolution. 

The final step in our shift-twist invariant algorithm is to compute the com- 
pletion field (the product of the source and sink fields) in a shiftable-twistable 
basis. The particular basis used to represent completion fields is the same as the 
one used to represent the source and sink fields, except that the variance of the 
Gaussian basis functions in needs to be halved. The need to use a slightly 
different basis to represent completion fields is not biologically implausible, since 
the experimental evidence described in Section El suggests that the neural locus 
of the source and sink fields could be VI, while completion fields are more likely 
located in V2. 

6 The Solution of the Fokker-Planck Equation 

In this section we derive a shift-twist invariant linear transformation, c(t+At) = 
(AoD)c(t), of the coefficient vector which evolves the Fokker-Planck equation in 
a shiftable-twistable basis. The derivation holds for any shiftable-twistable basis 
constructed from shiftable-twistable functions of the form, <ZG(a;,0) = ^(a;)e®“®. 
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for some function, ip{x). Since the transformation, AoD, will only involve inter- 
actions between functions, ^/’(a;), at different positions kA, and not at different 
scales or orientations, the basis functions and coefficients will be denoted by 
and Ck,w(t) respectively]] 

To derive an expression, 

C£^r,{t + At) = J2k.u ^e,v;k,Lo{At) Ck,uj{t) > ( 6 - 1 ) 

for the advection transformation. A, in the basis, ^k,Lj, we exploit the fact that 
spatial advection can be done perfectly using shiftable basis functions, ipk^x), in 
R^, and the continuous variable, 9 G S^. Suppose that P is given in the form, 

P{x,9;t) = Ck,u{t) -^k{x) = Y.k Ck.e{t) -ipkix) , (6.2) 

where c{t) is related to c{t) by the standard synthesis formula for Fourier series, 
Ck ,0 = Cfe,we“®, which we denote by c = F“^c. Then the translation of P in 
direction, 0, at unit speed, for time. At, is given by 

P{x,9;t + At) = P{x — At[cos6,smd{^ ,9 ;t) (6.3) 

= X! Ck, 9 {t) ipkix - At[cos 9, sin 6]^) , (6.4) 

k 

where the second equation follows from equation ona)- The shiftability of ip then 
implies that 

ce,e{t + At) = J2k ^£,e-,k,e{At) Ck,e{t) , (6.5) 

where 

Ai^e-k,e{At) = bi^k{At[cos9,sin9]'^) . ( 6 . 6 ) 

Finally, the advection transformation. A, in the basis, ^k,ujj is given by the 
similarity transformation, A = FAF~^, where F denotes the standard analysis 
formula for Fourier series, (F/)(w) = ^ /(0)e““®c?0. Since c = Fc we have 

the following result. 

Theorem 1. In the basis, I'k,uj, the advection transformation, A, is given by 
Cl,r^{t + At) = J2k,u; h(,^k,r,~uj{^t) Ck,uj{t) , (6.7) 

where 

bk.rf{At) = bk{ At[cos 9, sin 9]'^) d9 . (6.8) 

In particular, the transformation. A, is shift-twist invariant and is a convolution 
operator on the vector space of coefficients, Ck,u- 

^ Since we are using Fourier series in 9 the transformation, D, can be implemented 
in a shift-twist invariant manner by applying a standard finite differencing scheme 
to the coefficients. 
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Fig. 3. The geometry of the straight line completion field experiment {left). Graph 
(right) of the mean along a section normal to the straight line completion field as a 
function of the direction, 4>, for our new algorithm (dashed line) and for the algorithm 
of (221 (solid line) 



Theorem ^ implies that the computation of source and sink fields can be 
performed in a recurrent neural network using a fixed set of units as described 
in 1221 . Since the advection transformation, A, is a convolution operator on the 
space of coefficients, for efficiency’s sake we implemented both A and D in the 
3D Fourier domain of the coefficient vector. In this domain, A is given by a 
diagonal matrix and D by a circulant tridiagonal matrix. 

7 Experimental Results 

We present three experiments demonstrating the Euclidean invariance of our 
algorithm. In each experiment, the Gaussian-Fourier basis consisted of K = 160 
translates in each spatial variable of a Gaussian (of period X = 40.0 units), 
and harmonic signals of A = 92 frequencies in the angular variable, for a total 
of 2.355 X 10® basis functions. Pictures of completion fields were obtained by 
analytically integrating over 0 and rendering the completion field on a 256 x 256 
grid. 

We compare the new algorithm with the finite differencing scheme of m- 
For the method of the 40.0 X 40.0 x 2tt space was discretized using a 256 x 
256 spatial grid with 36 discrete orientations, for a total of 2.359 x 10® Dirac 
basis functions. The intent was to use approximately the same number of basis 
functions for both algorithms. The initial conditions were represented on the grid 
using tri-linear interpolation and pictures of the completion fields were obtained 
by summing over the discrete angles. The same parameters were used for both 
algorithms. The decay constant was r = 4.5 and the time increment, At = 0.1. 
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Fig. 4. Straight line completion fields due to an initial stimulus consisting of two points 
on a circle with direction, (f>, normal to the circle, for (p = 0°, 5°, 10°, 15° {left to right). 
The completion fields were computed using the algorithm of f22j {top row) and using 
the new algorithm {bottom row) 



The diffusion parameter was a = 0.08 for the first and second experiments and 
cr = 0.14 for the third@ In Figures E] El and El the completion fields constructed 
using the algorithm of 1221 are in the top row, while those constructed using the 
new algorithm are in the bottom row. 

In the first experiment, we computed straight line completion fields joining 
two diametrically opposed points on a circle of radius, 16.0, with initial directions 
normal to the circle. That is, given an angle, 4>, the initial stimulus consisted of 
the two points, (±16.0 cos (/), ±16.0 sin (/), </>), see Figure 0 (left). The completion 
fields are shown in Figure 0 with those in the top row, computed using the 
method of 1221 , clipped above at 2 x 10 

To compare the degree of Euclidean invariance of the two algorithms, we 
extracted a section of each completion field along the diameter of the circle 
normal to the direction of the completion field. In Figure 0 (right), we plot the 
mean of each section as a function of the angle (p. The dashed line indicates 
the means computed using the new algorithm, and the solid line shows the 
means computed using the algorithm of Eal The fact that the dashed line 
graph is constant provides solid evidence for the Euclidean invariance of the 
new algorithm. The solid line graph demonstrates the two major sources of the 
lack of Euclidean invariance in the method of First, the rapid oscillation of 

® The diffusion parameter, a, was required to be larger in the third experiment because 
of the high curvature circles in the Kanizsa triangle figure. 

® The angles, p, were taken in 5° increments from 0° to 45°. For illustration purposes 
the b-axis was extended to 360° so as to reflect the symmetry of the grid. Both 
graphs were normalized to have average value one. 
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Fig. 5. Completion fields due to the Ehrenstein initial stimulus in Figure da) {left 
column) and with the initial conditions rotated clockwise by 45° {right column). The 
completion fields were computed using the algorithm of {top) and using the new 
algorithm {bottom) 



period 10° is due to the initial conditions coming in and out of phase with the 
angular grid. This 10° periodicity can be seen in the periodicity of the general 
shape of the completion fields in the top row of Figure ^ Second, the large 
spikes at 90° intervals are due to the anisotropic manner in which the advection 
transformation was solved on the spatial grid. These large spikes correspond to 
the very bright horizontal line artifacts in the first two completion fields in the 
top row of Figure 0 

In the second experiment, we computed completion fields due to rotations 
of the Ehrenstein initial stimulus in Figure da). Pictures of the completion 
fields are shown in Figure The left column shows the completion fields 
due to the Ehrenstein stimulus in Figure da), while in the right column the 

Because of the periodicity in the spatial variables, x, to avoid wrap around in this 
experiment, for the new algorithm the computation was performed on a 80.0 x 80.0 x 
2tt space with K = 320. 
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Fig. 6. Completion fields due to the Kanizsa triangle initial stimulus in Figure Hb) 
{left column) and with the initial conditions rotated clockwise by 5° {right column). 
The completion fields were computed using the algorithm of {top) and using the 
new algorithm {bottom) 



initial conditions have been rotated clockwise by 45°. The completion fields 
computed using the method of m were clipped above at 1.25 x 10 For our 
final experiment, we compute completion fields due to rotations and translations 
of the Kanizsa Triangle stimulus in Figure mb). Completion fields are shown 
in Figure El which was discussed in the Introduction, and in Figure El The 
left column of Figure El shows completion fields due to the Kanizsa Triangle 
in Figure mb). In the right column the initial conditions have been rotated 
clockwise by 5°. The completion fields computed using the method of m were 
clipped above at 9 x 10“®. 

The completion fields in the bottom rows of Figures El and El and in Figure El 
demonstrate the Euclidean invariance of our new algorithm. This is in marked 
contrast with the obvious lack of Euclidean invariance in the completion fields 
in the top rows of Figures 0 and El The visible straight line artifacts in these 
completion fields, which are oriented along the coordinate axes, are due to the 
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anisotropic nature of the advection process in the algorithm of m, and (to a 
lesser extent), to the way in which the initial conditions were represented on the 
grid. 



8 Conclusion 

An important initial stage in the analysis of a scene requires completion of 
the boundaries of partially occluded objects. Williams and Jacobs introduced 
the notion of the stochastic completion field which measures the probability dis- 
tribution of completed boundary shapes in a given scene. In this article we have 
described a new, parallel, algorithm for computing stochastic completion fields. 
As is required of any computational model of human visual information proces- 
sing, our algorithm attempts to reconcile the apparent contradiction between 
the Euclidean invariance of human early visual computations on the one hand, 
and the observed sparseness of the discrete spatial sampling of the visual field 
by primary and secondary visual cortex on the other hand. The new algorithm 
reconciles these two contradictions by performing the computation in a basis 
of separable functions with spatial components similar to the receptive fields of 
simple cells in primary visual cortex. In particular, the Euclidean invariance of 
the computation is achieved by exploiting the shiftability and twistability of the 
basis functions. 

In this paper, we have described three basic results. First, we have generalized 
Simoncelli et al. ’s notion of shiftability and steerability in to a more general 
notion of shiftability and twistability in x S^. The notion of shiftability and 
twistability mirrors the coupling between the advection and diffusion terms in 
the Fokker-Planck equation, and at a deeper level, basic symmetries in the un- 
derlying random process characterizing the distribution of completion shapes. 
Second, we described a new method for numerical solution of the Fokker-Planck 
equation in a shiftable-twistable basis. Finally, we used this solution to compute 
stochastic completion fields, and demonstrated, both theoretically and experi- 
mentally, the invariance of our computation under translations and rotations of 
the input pattern. 



References 

1. Blasdel, G., and Obermeyer, K., Putative Strategies of Scene Segmentation in 
Monkey Visual Cortex, Neural Networks, 7 , pp. 865-881, 1994. 

2. Cowan, J.D., Neurodynamics and Brain Mechanisms, Cognition, Computation 
and Consciousness, Ito, M., Miyashita, Y. and Rolls, E., (Eds.), Oxford UP, 1997. 

3. Daugman, J., Uncertainty Relation for Resolution in Space, Spatial Frequency, 
and Orientation Optimized by Two-dimensional Visual Cortical Filter, J. Opt. 
Soc. Am. A, 2 , pp.1160-1169, 1985. 

4. Daugman, J., Complete Discrete 2-D Gabor Transforms by Neural Networks for 
Image Analysis and Compression, IEEE Trans. Acoustics, Speech, and Signal 
Processing 36(7), pp. 1,169-1,179, 1988. 



116 



J.W. Zweck and L.R. Williams 



5. Eyesel, U. Turning a Corner in Vision Research, Nature, 399 , pp. 641-644, 1999. 

6. Freeman, W., and Adelson, E., The Design and Use of Steerable Filters, IEEE 
Trans. PAMI, 13 (9), pp. 891-906, 1991. 

7. Gilbert, C.D., Adult Cortical Dynamics, Physiological Review, 78 , pp. 467-485, 
1998. 

8. Grossberg, S., and Mingolla, E., Nenral Dynamics of Form Perception: Boundary 
Completion, Illusory Figures, and Neon Color Spreading, Psychological Review, 
92, pp. 173-211, 1985. 

9. Heitger, R. and von der Heydt, R., A Computational Model of Neural Contour 
Processing, Figure-ground and Illusory Contours, Proe. of 4th Inti. Conf. on Com- 
puter Vision, Berlin, Germany, 1993. 

10. von der Heydt, R., Peterhans, E. and Baumgartner, G., Illusory Contours and 
Cortical Neuron Responses, Science, 224 , pp. 1260-1262, 1984. 

11. Iverson, L., Toward Discrete Geometric Models for Early Vision, Ph.D. disserta- 
tion, McGill University, 1993. 

12. Kalitzin, S., ter Haar Romeny, B., and Viergever, M., Invertible Orientation Bund- 
les on 2D Scalar Images, in Scale-Space Theory in Computer Vision, ter Haar 
Romeny, B., Florack, L., Koenderink, J. and Viergever, M., (Eds.), Lecture Notes 
in Gomputer Science, 1252 , 1997, pp. 77-88. 

13. Li, Z., A Neural Model of Contour Integration in Primary Visual Cortex, Neural 
Computation, 10(4), pp. 903-940, 1998. 

14. Marcelja, S. Mathematical Description of the Responses of Simple Cortical Cells, 
J. Opt. Soe. Am., 70 , pp. 1297-1300, 1980. 

15. Mumford, D., Elastica and Computer Vision, Algebraic Ceometry and Its Appli- 
cations, Chandrajit Bajaj (ed.). Springer- Verlag, New York, 1994. 

16. Parent, P., and Zucker, S.W., Trace Inference, Curvature Consistency and Curve 
Detection, IEEE Transactions on Pattern Analysis and Maehine Intelligenee, 11 , 
pp. 823-889, 1989. 

17. Shashua, A. and Ullman, S., Structural Saliency: The Detection of Globally Sali- 
ent Structures Using a Locally Gonnected Network, 2nd Inti. Conf. on Computer 
Vision, Clearwater, FL, pp. 321-327, 1988. 

18. Simoncelli, E., Freeman, W., Adelson E. and Heeger, D., Shiftable Multiscale 
Transforms, IEEE Trans. Information Theory, 38(2), pp. 587-607, 1992. 

19. Thornber, K.K. and Williams, L.R., Analytic Solution of Stochastic Completion 
Fields, Biological Cybernetics 75 , pp. 141-151, 1996. 

20. Thornber, K.K. and Williams, L.R., Orientation, Scale and Discontinuity as 
Emergent Properties of Illusory Contour Shape, Neural Information Processing 
Systems 11 , Denver, CO, 1998. 

21. Williams, L.R., and Jacobs, D.W., Stochastic Completion Fields: A Neural Model 
of Illusory Contour Shape and Salience, Neural Computation, 9(4), pp. 837-858, 
1997, (also appeared in Proc. of the 5th Inti. Conference on Computer Vision 
(ICCV ’95), Cambridge, MA). 

22. Williams, L.R., and Jacobs, D.W., Local Parallel Computation of Stochastic Com- 
pletion Fields, Neural Computation, 9(4), pp. 859-881, 1997. 

23. Williams, L.R. and Thornber, K.K., A Comparison of Measures for Detecting 
Natural Shapes in Cluttered Backgrounds, Inti. Journal of Computer Vision, 34 
(2/3), pp. 81-96, 1999. 

24. Wandell, B.A., Foundations of Vision, Sinauer Press, 1995. 

25. Yen, S. and Finkel, L., Salient Contour Extraction by Temporal Binding in a 
Cortically-Based Network, Neural Information Processing Systems 9, Denver, CO, 
1996. 




Bootstrap Initialization of Nonparametric 
Texture Models for Tracking 



Kentaro Toyama^ and Ying Wu^ 

^ Microsoft Research, Redmond, WA 98052, USA 
kentoyOmicrosof t . com 

2 University of Illinois (UIUC), Urbana, IL 61801, USA 
yingwuSif p . uiuc . edu 



Abstract. In bootstrap initialization for tracking, we exploit a weak 
prior model used to track a target to learn a stronger model, withont 
manual intervention. We define a general formulation of this problem 
and present a simple taxonomy of snch tasks. 

The formulation is instantiated with algorithms for bootstrap in- 
itialization in two domains: In one, the goal is tracking the position of a 
face at a desktop; we learn color models of faces, using weak knowledge 
about the shape and movement of faces in video. In the other task, we 
seek coarse estimates of head orientation; we learn a person-specific el- 
lipsoidal texture model for heads, given a generic model. For both tasks, 
we use nonparametric models of surface texture. 

Experimental results verify that bootstrap initialization is feasi- 
ble in both domains. We find that (1) independence assumptions in the 
learning process can be violated to a significant degree, if enough data is 
taken; (2) there are both domain-independent and domain-specific me- 
ans to mitigate learning bias; and (3) repeated bootstrapping does not 
necessarily result in increasingly better models. 



1 Introduction 

Often, we know something about the target of a tracking task in advance, but 
specific details about the target will be unknown. For example, in desktop inter- 
faces, we are likely to be interested in the moving ellipsoid that appears in the 
image, but we may not know the user’s skin color, 3D shape, or the particular 
geometry of the facial features. If we could learn this additional information du- 
ring tracking, we could use it to track the same objects more accurately, more 
efficiently, or more robustly. 

This problem, which we call bootstrap initialization for tracking arises whene- 
ver the target object is not completely known a priori. In Section El we propose 
an abstract formulation of bootstrap initialization. Section El offers a taxonomy 
of bootstrap initialization problems and reviews previous work. Sections 0 and El 
discuss experiences with two domains in which the learned models are nonpara- 
metric models of target texture. We take the Bayesian perspective that models 
represent a tracking system’s “belief” about the target. Experiments show how 
different data sampling techniques affect learning rate and quality (Section El- 

D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. 1 13- IT33I 2000. 
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2 Bootstrap Initialization Formulation 

Given a prior belief about the target (inherited from the system designer), the 
goal of bootstrap initialization is to learn a more useful model of the target, using 
only information gained during actual tracking. We now introduce an abstract 
framework for this concept. Reference to Figure Ewill clarify the notation. 




Fig. 1. Abstract formulation of bootstrap initialization. 



The goal of tracking is the recovery of the configuration of the target, x„i G , 
at time tm, given a model of the target, a sequence of images, Im = {Ii ■ ■ ■ Im} 
(where Ii is the image taken at time ti). 

Automatic initialization is impossible without some starting belief about the 
target. So, we assume that there is an initial model, tt, that can be used to 
track targets with some reliability. This means that there is some prespecified, 
initial tracking function, /^(I^), which, given a sequence of images and an 
initial model, returns a distribution, (x) , for the probable configurations that 

the target assumes at time tm- (We will assume that tracking functions return 
distribution functions and that should it be necessary, some way to choose a 
single estimate x, given p^(x), is also specified.) The initial model represents a 
belief about the target that the system designer passes on to the algorithm, e.g., 
the target is blue, it is round, it matches template A, etc. We leave the form 
of the initial model intentionally vague; what matters is the existence of an 
that incorporates the information contained in tt. The initial model need not be 
particularly reliable, as long as it correlates to some degree with characteristics 
of the target that distinguish it from non-target objects. 

For bootstrap initialization, we also posit a data acquisition function, g{Xm), 
which takes an image sequence and returns observations, Zm(x), defined over the 
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state space. Note that Z maps target states to observations. The observations 
are of the type that will be relevant to the final model, the model to be acquired. 

The final model is represented hy 0. 6 will contain all of the information rele- 
vant to the final tracking function. Knowing 0 allows the final tracking function, 
to output Pm(x). Wc assume that the final model itself has some prior, 
denoted (fi, which contains the same type of information contained in 0. That 
is, f^{Irn) ~ 4> not 0 - makes sense, although it may not provide good tracking 
output. In general, we will be concerned with determining a final model for time 
t > tn, for n > 1. 

Next, let a pair of corresponding observations and tracking output be denoted 
Di = (p°(x), Zi(x)). By expressing its degree of belief in intervals of X, p°(x) 
effectively provides supervision that points out which values in the range of Zi(x) 
are more or less reliable as training data. Thus, represents a single instance 
of supervised data annotated by the initial tracking function. 

Let T>n = (Di, . . . , D„). These are fed to a learning function, h{T>n, cfi), which 
takes the available data and learns the final model. 

This framework has been structured broadly enough to encompass existing 
forms of bootstrap initialization. As an example, consider a recursive estimation 
scheme such as the Kalman filter. In our framework, the Kalman filter corre- 
sponds to the following: n = 1, /^ = /°, the models tt, cf, and 0 contain state 
and covariance parameters, cf> = tt, and h updates (fi to 0. Because = /°, 
and TT and 0 share similar structure, the Kalman filter can (and does) iterate 
bootstrap initialization by setting tTj = 0i-i- 

The preference for the function fg over supplies the entire raison d’etre 
for bootstrap initialization; we expect at least one of the following to be true: 

— fg is more accurate than /^, e.g., for ground truth target configuration x*, 

II argn^xp^(x) — x*|| < || argnmxp°(x) — x*||, 

— fg can be computed more efficiently than /^, 

— fg is more robust than fg, or 

— fg is otherwise preferable to /^. 

We anticipate that in most applications, the forms of x, tt, (f>, 0, /°, and 
will be well understood. Thus, the interesting problems in bootstrap initialization 
are in the design of the acquisition function, g, and the learning function, h. 

3 Taxonomy and Related Work 

To help restrict our attention to a small subset of bootstrap problems we will 
consider a taxonomy for the common types of initialization problems in tracking. 
We propose a classification of initialization problems based on the following axes: 

~ Does the final model learn information about the object geometry or the 
surface texture of the target? 
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— Does the final model involve information about static or dynamic pro- 
perties of the target? 

— Is the final model parametric or nonparametric? 

— Does the learning take place adaptively or in batch mode? 

— Is the information contained in the initial and final models same or dif- 
ferent? 

Very little prior work addresses automatic initialization for the sake of trac- 
king, but there is a body of related work that fits the bootstrap initialization 
paradigm. 

Classic structure from motion algorithms can be thought to learn the static 
(rigid) geometry of an object, often outside of any explicit parametrization. 
Most such work is cast as a batch problem, and rarely as an initialization step for 
tracking, but there are exceptions. In some facial pose tracking work, for example, 
3D points on the face are adaptively estimated (learned) using Extended Kalman 
Filters jl Hj . Care must be used to structure the EKF correctly 0, but doing so 
ensures that as the geometry is better learned, tracking improves, as well. 

Other research focuses on learning textural qualities. Again, in the domain 
of facial imagery, there is work in which skin color is modeled as a parametrized 
mixture of n Gaussians in some color space EEa. Work here has covered both 
batch mi and adaptive learning with much success. The preferred learning 
algorithm for parameter learning in these instances is expectation-maximization. 

Although color distributions are a gross quality of object texture, learning of 
localized texture is also of interest. Work here focuses on intricate facial geometry 
and texture, using an array of algorithms to recover fine detail [Z]. 

Finally, there is research in learning of dynamic geometry - the changing con- 
figuration (pose or articulation) of a target. The most elementary type occurs 
with the many variations of the Kalman filter, which “learns” a target’s geome- 
tric state |3- In these cases, the value of the learned model is fleeting, since few 
targets ever maintain fixed dynamics. More interesting learning focuses on mo- 
dels of motion. Existing research includes learning of multi-state motion models 
of targets which exhibit a few discrete patterns of motion EISI. 

Our work focuses on bootstrap initialization of nonparametric models for 
the static texture of faces. In contrast with previous work, we explicitly consider 
issues of automatic learning during tracking without manual intervention. 



4 Nonparametric Texture Models 

In our first example, we consider learning a skin-color distribution model of a 
subject’s face, using a contour tracking algorithm to offer samples of target skin 
color. We use this model for color-based tracking of facial position. 

In the second, we learn a person-specific 3D texture model of a subject’s 
head, using a generic model to provide supervisory input. The 3D model is used 
to estimate approximate head orientation, given head location in an image. 
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The models we use will be nonparametric in the sense adopted by common 
statistical parlance. For example, the skin-color distributions are modeled by hi- 
stograms. Strictly speaking, histograms have a finite number of parameters equal 
to the number of bins, but they can also be considered discrete approximations 
to elements of a nonparametric function space. Likewise, the 3D texture models 
we use are discrete approximations to a continuous surface model of texture. 



4.1 Color PDF 

Tracking of faces using color information is popular both for its speed and sim- 
plicity. Previous techniques require manual initialization or parameter tuning in 
order to achieve optimal performance imni. At best, a manually initialized mo- 
del adapts over time m Below, we consider automatic initialization of a color 
model that bootstraps from prior knowledge about object shape and movement. 



Framework Instantiation We will assume that the goal is estimation of x = 
(x,y,s), the position and scale of an approximately upright face at a desktop 
computer. 

The initial model, tt, encapsulates a priori knowledge of user’s heads at a 
PC. In particular, they are likely to project edges shaped like an ellipse at a 
limited range of scales, and they are likely to exhibit motion from time to time. 
Given this knowledge and an incoming image sequence, Im, the initial tracking 
function, /°, performs adjacent frame differencing (with decay [^) on frames Im 
and Im-i to detect moving edges and follows this with simple contour tracking 
0 to track the most salient ellipse. 

The acquisition function, g(Im), returns the following observation function: 
Zm(x) = Im{^) ~ the mapping from state to observation simply returns the 
RGB pixel value of the pixel at the center of the tracked ellipse in the current 
image (other schemes such as averaging among a set of pixels are also possible 
and may reduce noise). 

Finally, we consider the form of the prior, (p, and posterior, 6, of the final 
model. Both are represented by normalized histograms, which serve as approxi- 
mations to the pdf of skin-color. The histogram itself will be represented by a 
Dirichlet distribution. The reasons for this choice will be explained in the next 
section. Observed pixel values will be represented by the random variable U S W. 

Given a likelihood function for skin color, it is a simple matter to define a 
final tracking function that tracks a single face in an image by computing spatial 
moments HUE]. 



Bootstrap Initialization Algorithm We now describe our bootstrap initia- 
lization algorithm for learning color pdfs of skin, assuming we have a body of 
data, T>, from some frames of tracking. 

First, we can cast the goal of bootstrap initialization in this case to be 
p(U|2?„,</>) (recall that T>n the body of supervised data acquired between time 
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t\ and tn). We can determine p(U|2?„, (j)) if we know the final model, p{9\Vm (j>)- 
The latter can be computed by Bayes’ Rule: 






p{e\cf>)p{v\e,(t^) 

p{V\(j)) 



where the marginal likelihood, p{'D\4>), is given by 



( 1 ) 



p{v\ct>) = I p{v\9,(P)p{e\ct))de. 



( 2 ) 



We can then compute p(\J\T>, <p) by marginalizing over 9, 



p{\J\V, 4>) = j p(U|0, <t>)p{9\V, 4>)d9. (3) 

In general, neither the posterior probability in Equation^nor the integral in 
Equation 0are easy to compute, since expressions for p[T)\9, (p) and p{9\(p) can 
be arbitrarily complex. Fortunately, there are approximations that simplify the 
analysis. We discretize U and assume that our distributions can be captured by 
conjugate distributions 0, which provide tractable, analytical solutions under 
certain assumptions about the models. 

First, we discretize the observed variable, U, such that it can assume any of r 
possible values, u^, . . . , u''. Assume that the final model parameters are given by 
9 — {9i, . . . 9r}, with 9k > 0, and = !> and that the likelihood function 

for U is given by 

p{lJ = u>^\9,(t>) = 9k, (4) 

for k = l,...,r. Clearly, we can represent any pdf to arbitrary precision by 
varying r. In our case, we use 32^ bins, where each of the RGB color channels is 
quantized into 32 discrete values. 

If the data, can be reduced to n independent observations of U, the 
process of observation is a multinomial sampling, where a sufficient statistic j2j 
is the number of occurrences of each 9k in As mentioned earlier, we force 
the algorithm to choose one observation per frame as follows: For each D^, we 
choose the pixel at Zx', where x' = argmaxxp°(x). Then, if we let Nk be equal 
to the total number of occurrences of 9k in the data {N = X)fe=i ^k), then 

r 

p(p„i0,c/.)=n^f ■ (5) 

k=l 

What remains now is to determine the form of the prior, p{9\cp). We choose 
Dirichlet distributions, which when used as a prior for this example, have several 
nice properties. Among them are the fact that (1) a Dirichlet prior ensures a 
Dirichlet posterior distribution, and (2) there is a simple form for estimating 
p(U|2?, 4>), which is our eventual goal. The Dirichlet distribution is as follows: 

p{9\4>) = BiT{9\ai, . . . ,ttr) (6) 
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where ak is a “hyperparameter” for the prior, with ak > 0, a = X]fe=i and 
r{-) is the Gamma function 0. 

Properly, a Dirichlet distribution is a unimodal distribution on a (r — 1)- 
dimensional simplex. When used to represent a distribution of a single varia- 
ble with r bins, it can be interpreted as a distribution of distributions. In our 
case, we use it to model the distribution of possible distributions of U, where 
p(U = u^|X>, 4>) is the expected probability of integrated over 6 (Equation^. 
Examples of Dirichlet distributions for r = 2 (also known as Beta distributions) 
are given in Figure 0 




(a) 




Fig. 2. Examples of (a) 2-parameter Dirichlet functions (Beta functions) and (b,c) their 
corresponding 2-bin histograms. A 10-parameter Dirichlet function could represent the 
histogram in (d). 



As distributions of distributions, Dirichlet distributions contain more infor- 
mation than a single pdf alone. For example, while the pdf shown in Figure0b) 
is the expected pdf for any Beta distribution with a\ = a 2 , the Beta distri- 
bution also gives us information about our confidence in that pdf. Specifically, 
as a = Qfi -I- «2 increases, our confidence in the expected pdf increases as well. 
This is illustrated by the increased peakedness corresponding to increasing a in 
Figure 0 a). 

With this prior, the posterior becomes 

p{6\V, </.) = Dir(e|ai + fVi, . . . , a, + W), (8) 

and the probability distribution for U„+i is 

p(u„+i = VL^\V, cf,)= [ 9kP{e\v, (f,)de = (9) 

The surprising consequence of the discretization of 0 and the assumption of the 
Dirichlet prior is the simple form of Equation 0 Effectively, we need only count 
the number of samples in the data for each bin of the histogram. Also, note how 
the expression appeals to our intuition: First, if = 1 for all k (a fiat, low- 
information prior, which we use in our implementation), then the probability of 



126 K. Toyama and Y. Wu 



observing is {Nj. + l)/(-/V + r), which asymptotically approaches the fraction 
that is observed in the data. Second, as the number of observations increases, 
the effect of the prior diminishes; in the limit, the influence of the prior vanishes. 
Lastly, we find a particularly intuitive form for expressing our prior beliefs. Our 
relative sense for how often each of the occurs is decided by the relative values 
of «fc, and the confidence with which we believe in our prior is determined by 
their sum, a. 



4.2 3D Feature-Mapped Surface 

In our second example, we consider the task of estimating a person’s approximate 
head pose, given head location in an image. We distinguish “head pose” from 
“facial pose” by the range of applicability: facial pose is restricted to images 
where most of the face is visible. 

In contrast to pose-tracking techniques that give precise pose estimates for 
close-up, well- lit facial images of known subjects [tilDII4llt)ll6H /j . we consider 
coarse, but robust, estimation of pose for unknown subjects under a variety of 
circumstances. By using a generic model to provide initial pose estimates, we 
can learn a new model that is tailored to that person. 



Framework Instantiation The output is x = {rx,ry,rz), the rotational pose 
of a person’s head. We will assume that other parameters (position and scale, 
for example) have been recovered by other means fSII Oj . 

In this case, the initial model, tt, the final model, 6, and the prior for the 
final model, 4> all take the same form: We model heads as ellipsoids with a set of 
points on the surface. Each point, indexed by j (1 < j < m), is represented by its 
coordinates, (lying on the ellipsoid surface), and a pdf representing the belief 
probability, pj(z|x) - the belief that given a particular pose, the point j will 
project observation z. Model points are placed at the intersections of regularly- 
spaced latitudinal and longitudinal lines, where “north pole” coincides with the 
front of the head (see Figure 01(a)). 




Fig. 3. (a) Model point distribution; (b) rotation-invariant sum of Gabor wavelets 
for determining local edge density; (c) coefficients for a learned model viewed exactly 
frontally, for one kernel. 



The domain of the pdfs stored at model points form a feature vector space. An 
element of this space, z, is a 5-element feature vector consisting of the transform 
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coefficients when five convolution kernels are applied to a pixel in the image. 
For a model point j, z-^ is the feature vector for the pixel on which point j 
would project via scaled orthographic projection (assuming fixed orientation x). 
The kernels extract information about local “edge density,” which tends to be 
consistent for corresponding points of people’s heads across different illumination 
conditions M- 

The acquisition function, g, returns the observation function Z, where Z(x) 
is the concatenation of the feature vectors, {z^ }, observed at points in the image 
which correspond to the orthographically projected points, {j}, of the model 
when oriented with pose x (for 1 < j < m). For model points j that occur in 
the hemisphere not facing the image plane, z^ is undefined. 

Because the underlying models are the same, the tracking functions, and 
are identical. In particular, they simply compute the maximum likelihood 
pose. Given a cropped image of a head, the image is first rescaled to a canonical 
size and histogram-equalized. The resulting image is convolved with the five 
templates described above. Finally, we compute 

X* = argmaxp(x|Z) = argmaxp(Z|x), (10) 

using Bayes’ Rule, where we ignore the normalization constant and assume a 
constant, low-information prior over possible head poses. More detail on the 
pose estimation algorithm is presented elsewhere m- 



Bootstrap Initialization Algorithm Given a set of pose-observation pairs, 
T>, where the pose pdfs are generated using a generic head model, bootstrapping 
a person-specific model proceeds as follows. 

Let Sj = {z^ : the j-th element of Zi(argmaxx p°(x)),Vi}. That is, Sj re- 
presents the set of all observations that would project to model point j, if, for 
each pose-observation pair, the pose estimated to have the maximum likelihood 
is used. 

Once all of the data is collected for each model point, j, we estimate the 
pdf for that point. In our implementation, we approximate the pdf with a single 
Gaussian whose mean and covariance coincide with that for Sj. This is consi- 
stent with a Bayesian approximation of the model pdfs with a low-information 
prior, cf), which contains Gaussian pdfs with zero mean and very large, constant 
covariances at each model point. The data at each model point is thus assumed 
to be indepedent of data at other points - this is not the case, but experiments 
suggest independence serves as a reasonable approximation. 



5 Results and Analysis 

Both learning algorithms were implemented as described above. Initial results 
indicate that the bootstrapping algorithms work as expected - in both cases, the 
final model is learned without manual intervention, when only an initial model 
was available. 
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Fig. 4. Average estimation errors. 



For the skin-color initialization task, Figure E|shows an example input image 
(a) and the corresponding skin-color map (b) using the final model learned over 
60 frames during 2 seconds of tracking. 

For the head-pose task, Figure E] displays average error rates over 4 different 
angular ranges. The values indicate errors averaged over runs on 10 separate 
sequences of people recorded by different cameras, under different illumination 
conditions, and at varying distances from the camera. “Ground truth” pose was 
determined by hand because many of the data sequences were from prerecorded 
video. For testing purposes, all errors of the algorithm are measured with respect 
to the annotation. 

Because texture is more stable on the face than in hair, results were far more 
accurate when all or part of the face was actually visible. Thus, we report errors 
averaged over four regions of the pose space. The columns in Figure 0| show the 
range for which errors were averaged. These numbers indicate the difference in 
rotation about the y-axis between the annotated face normal and the camera’s 
optical axis. A typical result for a single individual is shown in Figure I5J a). 

The results suggest that no unreasonable approximations have been made 
- bootstrapping works as expected. Nevertheless, because our algorithms are 
based on independence assumptions which do necessarily hold, we examine the 
effect that algorithmic choices have on the final outcome. 



5.1 Data Dependencies 

Both of the learning algorithms presented are based on the assumption that data 
is acquired independently and without bias from the distribution the models try 
to capture. How likely and how important is it that these assumptions hold? 

In the case of generating learning examples from tracking, the acquired data is 
unlikely to represent independent samples for several reasons. First, the image se- 
quences involved in tracking generally exhibit little change from frame to frame. 
We thus anticipate that data from adjacent frames will exhibit considerable tem- 
poral dependencies. Second, initial tracking functions are unlikely to track the 
target with high accuracy or precision (hence the need for bootstrap initializa- 
tion at all). Thus, a certain amount of noise is expected to corrupt the training 
data. What is worse is if the initial tracking function (or the initial model on 
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Fig. 5. (a) Differences in estimation errors for generic model (top line) and bootstrap- 
iterated models (middle lines) for a typical subject. For comparison, the results for a 
person-specific model trained using manually annotated data are given as well (bottom 
line), (b) 1-D schematic representation of models. See Section 5. 



which it is based) presents a consistent bias in its estimates, which propagates to 
the data. The acquisition function may also introduce biases in a similar manner. 



Reducing Effects of Temporal Coherence: Dependencies due to temporal cohe- 
rence can be reduced in one of two ways. An intuitive approach is to sample 
data at random instances in time that are a sufficient interval apart. A “suffi- 
cient interval” would be on the order of time such that tracked states appear 
conditionally independent. For example, in learning the skin-color model, in- 
stead of taking samples at 30Hz, we could sample at intervals determined by 
a Poisson process with intensity adjusted to sample every 0.3 seconds or so. 
In Figurel^a), we plot the entropy, H{X) = — (x) log (x) , of the final 

model for skin color against the total number of data samples, where the lines 
represent variation in sampling frequency. We expect the entropy to converge as 
more samples are incorporated. We note that taking samples at lesser frequency 
increases the learning rate per datum, suggesting that temporal coherence can 
be broken through subsampling. 

Alternatively, data can be taken over a long enough period of time that 
a representative sequence of tracking contexts {i.e., spanning a wide range of 
target configurations and environmental contexts) is observed. Although the 
data may not be locally independent, the sufficient statistics of the data set 
should approximate those of a large, randomly sampled set. This behavior is 
evident in all of the plots in Figure El where the final models appear to converge 
to similar models, regardless of sampling frequency. The inversion in learning 
rates between Figure El a) and (b) suggests that one can trade off amount of 
data to process with time required to collect data. 
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Fig. 6. Entropy of final model plotted against number/time of data samples. In (a), 
the x-axis represents number of data samples; in (b), the time required to collect the 
samples. 



Weighting Data: Some of the problems with data acquisition may be alleviated 
if the initial tracking function returns a confidence value for its output. If we read 
these confidence values as indicators of the effective sample size that a particular 
datum represents, we can weight its contribution to the final model. 

For both examples, we can simply multiply each piece of data by a value 
proportional to its confidence. It seems strange to say that a single datum can 
represent more than one sample, so for both skin-color and head texture models, 
we normalize all weights such that they fall in the interval [0.0, 1.0]. For the 
skin-color model, we use the residual from ellipse tracking to weight each set 
of observations (better fits correspond to greater confidence) . In the case of the 
head-texture model, we weight by the normalized likelihood of the maximum 
likelihood pose, which is interpretable as an actual probability. 

Performance improves for both cases. Results for the skin-color model are 
shown in Figure Q Note how the pixel distribution is most concentrated in 
skin-colored regions in (c) because samples which were taken when tracking was 
unreliable were suppressed. This is in contrast to (b), where each sample was 
weighted evenly. 



Reducing Bias from Tracking and Acquisition Functions: The problem we are 
least able to overcome is bias in the initial tracking function and the acquisition 
function, since they provide the supervisory data, (p'^(x), z(x)). In the abstract, 
there is very little we can do to eliminate such a bias. But, there may be domain- 
specific solutions which help alleviate the problem. 

For example, the skin-color model learns a pdf consisting of mostly skin 
color, together with a small contribution from pixels taken inadvertently from 
the background. If we can learn the distribution of background pixels, we can 
eliminate these with a Bayesian decision criterion to determine whether a given 
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Fig. 7 . (a) A raw image, as input to the final tracking function, (b-c) likelihood of 
pixel skin-color, based on learned final model (likelihoods scaled so that highest is 
darkest), with total weight of training data size kept constant: (b) 1 sample from each 
frame; (c) data weighted by ellipse residual (1 sample from each frame, for 70% of 
frames), (d) Bayesian decision to separate foreground from background: Black pixels 
mark foreground pixels; gray pixels mark potential foreground pixels which are more 
likely to be background. 



pixel value, is more likely to be skin or background. That is, is skin if 

p(skin|u'^) > p(bg|u^) (11) 

p(u^|skin)p(skin) > p(u^|bg)p(bg), (12) 

where p(u^|skin) is acquired from Equation^ p(u^|bg) can be acquired similarly 
by simply sampling pixels outside of the tracked ellipse (in practice, we collect 
entire frames of pixels from just a few frames), and p(skin) and p(bg) are set 
based on the relative area that they are expected to occupy in an image con- 
taining a face. See Figure Q(d) for an example in which only those pixels which 
occur frequently on the face, but very infrequently in the background are consi- 
dered skin color. Modeling both skin and background as mixtures of a handful 
of Gaussians achieves a similar result but without the granularity possible 
with a nonparametric model. 

In the head orientation example, our original generic model exhibits a slight 
orientational bias - in Figure Etc), a slight turn of the head to the left is visible. 
We can eliminate this bias by finding the angle at which the model appears most 
symmetrical and averaging the model with its reflection. Doing so does in fact 
reduce some of the error generated by the final model (see Figure 0 Row 6 vs. 
any of Rows 3-5). 



Repeated Bootstrapping Finally, we mention the possibility of repeated 
bootstrapping. Clearly, if one model can be used to learn a second model, any 
combination of the first two models could be used to learn a third model. In Fi- 
gure III /' and 6(x) replace and p^(x), and bootstrap initialization iterates. 

Strangely, in both of our examples, repeated bootstrapping does not appear 
to improve the final models. For learning skin-color, repeated bootstrapping is 
good for adapting to a changing color model H2I, but for a fixed distribution, 
there is nothing more to be gained by going beyond the asymptotically learned 
color models. This is not surprising, since we have chosen to gather enough data 
in the first iteration to learn a good model. 
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Figures 0 and 0a), show that even for the head-texture case, bootstrapping 
beyond the first iteration does not appear to improve pose estimates. 

Figure I3b) shows a one-dimensional schematic of what we believe is taking 
place. The x-axis shows the angular position on the model and the y-axis gives 
the extracted feature value. The top figure shows a generic model, the middle 
figure shows the ground truth model, and the bottom graph shows the boot- 
strapped model. 

Pose estimation is equivalent to being given a short, noisy segment of the 
bottom function (the observation) and trying to find the best match displace- 
ment in the top function. In the case when we are trying to find a match to a 
uniquely- varying part of the model, such as Segment A’, the corresponding seg- 
ment is accurately localized (Segment A). This corresponds to cases when the 
front or side views (angular ranges 0-135) are being presented. Bootstrapping 
helps in this instance because the characteristics of the real model are transferred 
to the bootstrapped model. 

When the observation segment is more homogeneous as in Segment B’, the 
match is likely to be affected by noise and other small differences in the generic 
model, making matching inaccurate (Segment B). The bootstrapped model then 
merely recaptures the initial estimation error. This behavior was observed in 
many of the test images, where the bootstrapped model inhereted a tendency to 
misestimate back-of-head poses by a significant and consistent amount. 

It is not clear whether the ineffectiveness of repeated bootstrapping should 
be expected in other similar cases of bootstrap initialization. One distant coun- 
terexample is the remarkable success of iterated bootstrapping in learning linear 
subspace models of faces m 

6 Conclusion 

We have presented bootstrap initialization, an abstract framework for using one 
model to guide the initialization of another model during tracking. We presented 
two examples of bootstrap initialization for tracking, using nonparametric mo- 
dels of target surface texture. In the first, we acquired a strong skin-color model 
of a user’s face, given a weak edge-based model of user shape. In the second, 
we refined a generic model of head texture to suit a particular individual. Preli- 
minary experiments show the potential for bootstrap initialization in tracking 
applications; in both cases, initial implementations were able to learn a boot- 
strapped model without undue concern for data dependencies in the acquired 
training data. 

Additional experiments provided evidence toward the following tentative con- 
clusions: 

— Independence assumptions in the acquired training data can be violated to a 
great degree. Movement of target objects creates enough variation that the 
sufficient statistics of training data taken over an extended period of time 
closely match those of an ideally sampled data set. 



Bootstrap Initialization of Nonparametric Texture Models for Tracking 133 



~ Dependencies in data can be removed through both generic and domain- 
specific strategies. Final models are better learned by taking advantage of 
such tactics. 

— Repeated bootstrapping does not necessarily result in improved models. 

In future work, we expect to delve deeper into theoretical limits of bootstrap 
initialization and more broadly into other tracking domains. 
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Abstract. The problem of tracking pedestrians from a moving car is 
a challenging one. The Condensation tracking algorithm is appealing 
for its generality and potential for real-time implementation. However, 
the conventional Condensation tracker is known to have difficulty with 
high-dimensional state spaces and unknown motion models. This paper 
presents an improved algorithm that addresses these problems by using a 
simplified motion model, and employing quasi- Monte Carlo techniques to 
efficiently sample the resulting tracking problem in the high-dimensional 
state space. For N sample points, these techniques achieve sampling er- 
rors of 0{N~^), as opposed to for conventional Monte Carlo 

techniques. We illustrate the algorithm by tracking objects in both syn- 
thetic and real sequences, and show that it achieves reliable tracking and 
significant speed-ups over conventional Monte Carlo techniques. 



1 Introduction 

Since its introduction, the Condensation algorithm [1] has attracted much inter- 
est as it offers a framework for dynamic state estimation where the underlying 
probability density functions (pdfs) need not be Gaussian. The algorithm is 
based on a Monte Carlo or sampling approach, where the pdf is represented by 
a set of random samples. As new information becomes available, the posterior 
distribution of the state variables is updated by recursively propagating these 
samples (using a motion model as a predictor) and resampling. An accurate 
dynamical model is essential for robust tracking and for achieving real-time per- 
formance. This is due to the fact that the process noise of the model has to be 
made artificially high in order to track objects that deviate significantly from 
the learned dynamics, thereby increasing the extent of each predicted cluster in 
state space. One would then have to increase the sample size to populate these 
large clusters with enough samples. A high-dimensional state space (required for 
tracking complex shapes such as pedestrians) only makes matters worse. Isard 
et al. [2] use two separate trackers, one in the Euclidean similarity space and the 
other in a separate deformation space, to handle the curse of dimensionality. 

Our need for a tracking algorithm was for tracking moving objects (such 
as pedestrians) from a moving camera for applications in driver assistance sys- 
tems and vehicle guidance that could contribute towards traffic safety [4, 3] . The 
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problem of pedestrian detection has been addressed in [3] and [5], but without 
temporal integration of results. We believe that temporal integration of results 
is essential for the demanding performance rates that might be required for the 
actual deployment of such a system. This tracking problem, however, is compli- 
cated because there is signihcant camera motion, and objects in the image move 
according to unpredictable/unknown motion models. We want to make no as- 
sumptions about how the camera is moving (translation, rotation, etc.) or about 
the viewing angle. Hence it is not practically feasible to break up the dynamics 
into several different motion classes ([6, 7]) and learn the dynamics of each class 
and the class transition probabilities. We need a general model that is able to 
cope with the wide variety of motions exhibited by both the camera and the 
object, as well as with the shape variability of the object being tracked. 

A common problem that is often overlooked when using the Condensation 
tracker in higher dimensions is that typical implementations rely on the system 
supplied randO function, which is almost always a linear congruential generator. 
These generators, although very fast, have an inherent weakness that they are 
not free of sequential correlation on successive calls, i.e. if k random numbers 
at a time are used to generate points in /c-dimensional space, the points will 
lie on {k — 1) -dimensional planes and will not hll up the /c-dimensional space. 
Thus the sampling will be sub-optimal and even inaccurate. Another problem 
with these generators arises when the modulus operator is used to generate a 
random sequenee that lies in a certain range. Since the least significant bits of 
the numbers generated are much less random than their most significant bits, 
a less than random sequence results [8]. Even if one uses a ‘perfect’ pseudo- 
random number generator, the sampling error for N points will only decrease as 
as opposed to 0{N^^) for another class of generators (see Section 2). 

We must thus deal with the problems of high dimensionality, motion mod- 
els of unknown form, and sub-optimal random number generators, while at the 
same time attempt to achieve satisfactory performance. For accuracy, the sam- 
pling must be hne enough to capture the variations in the state space, while 
for efficiency, the sampling must be performed at a relatively small number of 
points. In mathematical terms, the goal is to reduce the variance of the Monte 
Carlo estimate. 

Various techniques (such as importance sampling [2] and stratified sampling 
[9]) have been proposed to improve the efficiency of the representation. In im- 
portance sampling, auxiliary knowledge is used to sample more densely those 
areas of the state space that have more information about the posterior proba- 
bility. Importance sampling depends on already having some approximation to 
the posterior (possibly from alternate sensors), and is effective only to the extent 
that this approximation is a good one. In stratified sampling, variance reduction 
is achieved by dividing the state space into sub-regions and filling them with 
unequal numbers of points proportional to the variances in those subregions. 
However, this is not practical in spaces of high dimensionality since dividing a 
space into K segments along each dimension yields K‘^ subregions, too large a 
number when one has to estimate the variances in each of these subregions. 
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A promising extension to Condensation that addresses all the issues discussed 
above is the incorporation of quasi- Monte Carlo methods [8, 10]. In such meth- 
ods, the sampling is not done with random points, but rather with a carefully 
chosen set of quasi-random points that span the sample space so that the points 
are maximally far away from each other. These points improve the asymptotic 
complexity of the search (number of points required to achieve a certain sampling 
error), can be efficiently generated, and are well spread in multiple dimensions. 
Our results indicate that significant improvements due to these properties are 
achieved in our implementation of a novel Condensation algorithm using quasi- 
Monte Carlo methods. Note that quasi-random sampling is complementary to 
other sampling techniques used in conjunction with the Condensation algorithm, 
such as importance sampling, partitioned sampling [11], partial importance sam- 
pling [7], etc., and can readily be combined with these for better performance. 

This paper is organized as follows: Section 2 gives a brief introduction to 
quasi-random sequences and their properties, including a basic estimate of sam- 
pling error for quasi-Monte Carlo methods, and establishes their relevance to 
the Condensation algorithm. We also indicate how such sequences can be gen- 
erated and used in practice. In Section 3 we describe a modified Condensation 
algorithm that addresses the issues of an unknown motion model, robustness 
to outliers, and use of quasi-random points for efficiency. In Section 4 we apply 
this algorithm to some test problems and real video sequences, and compare its 
performance with an algorithm based on pseudo-random sampling. The results 
demonstrate the lower error rate and robustness of our algorithm for the same 
number of sampling points. Section 5 concludes the paper. 

2 Quasi-Random Distributions 

2.1 Sampling and Uniformity 

Functionals associated with problems in computer vision often have a complex 
structure in the parameter space, with multiple local extrema. Furthermore, 
these extrema can lie in regions of involved or convoluted shape in the parameter 
space. Alternatively, the functionals may have a collapsed structure and have 
support on a sub-dimensional manifold in the space (perhaps indicating an error 
in modeling or in the choice of parameters) . If the sampling is to be successful in 
recovering the functional in such cases, the distributions of the sample points and 
their subdimensional projections must satisfy certain properties. Intuitively, the 
points must be distributed such that any subvolume in the space should contain 
points in proportion to its volume (or other appropriate measure). This property 
must also hold for projections onto a manifold. 

Quasi-random sequences are a deterministic alternative to random sequences 
for use in Monte Carlo methods, such as integration and particle simulations of 
transport processes. The discrepancy of a set of points in a region is related to 
the notion of uniformity. Let a region with unit volume have N points distributed 
in it. Then, for uniform point distributions, any subregion with volume a would 
have aN points in it. The difference between this quantity and the actual number 
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(a) 512 points in (0, 1)^ generated with 
Matlab’s pseudo-random number gen- 
erator. 




(b) 512 points from the Sobol’ sequence 
denoted by (+) overlaid on top of (a) 



Fig. 1. Distributions of pseudo-random points (a) and quasi random points overlaid 
(b). Observe the clustering of the pseudo-random points in some regions and the gaps 
left by them in others. The quasi-random points in (b) leave no such spaces and do not 
form clusters. 



of points in the region is called the “discrepancy.” Quasi-random sequences have 
low discrepancies and are also called low-discrepancy sequences. The error in 
uniformity for a sequence of N points in the fc-dimensional unit cube is measured 
by its discrepancy, which is 0((logfV)^iV^^) for a quasi-random sequence, as 
opposed to 0((loglogiV)^/^lV^^/^) for a pseudo-random sequence [15]. 

Figure 1 compares the uniformity of distributions of quasi-random points 
and pseudo-random points. Figure 1(a) shows a set of random points generated 
in (0, 1)^ using a pseudo-random number generator. If the distribution of points 
were uniform one would expect that any region of area larger than 1/512 would 
have at least one point in it. As can be seen, however, many regions considerably 
larger than this are not sampled at all, while points in other portions of the 
region form rather dense clusters, thus oversampling those regions. Thus from 
an information gathering perspective the sampling is sub-optimal. Figure 1(b) 
shows Sobol’ quasi-random points overlaid on the pseudo-random points. These 
points do not clump together, and fill the spaces left by the pseudo-random 
points. 

A good introduction to why quasi-random distributions are useful in Monte 
Carlo integration is provided by Press et al. [8]. As far as application of the 
technique to optimization or sampling is concerned, Niederreiter [10] provides 
a mathematical treatment of this issue. The goal is to sample the space of pa- 
rameters with sufficient fineness so that we are close enough to every significant 
maximum or minimum, and can be assured that the approximation to the func- 
tional in any given region is bounded, and is well characterized by the number 
of points chosen for sampling, N. We will motivate and state below the results 
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for the quasi-random points. For a more mathematical and formal treatment, 
consult [10, 17] 

Given N points, for the sampling to be effective, each point should be op- 
timally far from the others in the set so that a truly representative picture 
of the function being sampled is arrived at. Intuitively, if the points are suffi- 
ciently close, the approximation to the underlying functional at any point will be 
bounded. This can be made precise by the multidimensional analogue of Rolle’s 
theorem. The value of a function at some point X 2 can be approximated by its 
value at a neighboring point Xi according to 

f{x + 6) = f{x) + V/lj • 6, for some ^ such that j^j < j^j (1) 
where x^= x\+ 6. 

Thus for sufficiently smooth functions, our sampling of the function will 
be subject to errors on the order of 6, where 6 is characterized by the inter- 
sample point distance. The mathematical quantity “dispersion’’^ was introduced 
by Niederreiter [10] to account for this property of a set of sample points. Given 
a set of points, the dispersion is defined by the following construction: place balls 
at each of the sample points with radii sufficiently large to intersect the balls 
placed at the other points, so that the whole space is covered. We can now de- 
fine the average dispersion as the average radius of these balls, and the maximal 
dispersion by the maximum radius. The sampling error is thus characterized by 
the value of the dispersion of the set of sample points. 

As shown in [10], low-discrepancy distributions of points have low dispersions, 
and hence provide lower sampling errors (see Equation (1)) in comparison with 
point sets with higher discrepancies. 

2.2 Generating Quasi- Random Distributions 

Now that we have seen that quasi-random distributions are likely to be useful 
for numerical problems requiring random sampling, the question is whether such 
distributions exist, and how one constructs them. Several distributions of quasi- 
random points have been proposed. These include the Halton, Faure, Sobol’, and 
Niederreiter family of sequences. Several of these have been compared as to their 
discrepancy and their suitability for high-dimensional Monte Carlo calculations 
[14, 15]. The consensus appears to be that the Sobol’ sequence is good for prob- 
lems of moderate dimension (fc < 7), while the Niederreiter family of sequences 
seems to do well in problems of somewhat higher dimension. For problems in 
very large numbers of dimensions (fc>100), the properties of these distributions, 
and strategies for reducing their discrepancies to theoretical levels, are active 
areas of research [17]. 

The Sobol’ and the Niederreiter sequences of order 2, which can be generated 
using bit shifting operations, are the most efficient. For reasons of brevity, their 
generation algorithms are not discussed here; the readers are referred to [13, 
16]. The complexity of these quasi- random generators is comparable to that of 
standard pseudo-random number generation schemes, and there is usually no 
performance penalty for using them. 
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3 The Modified Tracking Algorithm 

In the standard formulation of the Condensation algorithm [1] , the sample posi- 
tions at time t are obtained from the previous approximation to the posterior 
being the probabilities, using the motion model p(Xt/Xt_i) 
as a predictor. The dynamics is usually represented as a second-order auto- 
regressive process, where each of the dimensions of the state space is modelled 
by an independent one-dimensional oscillator. The parameters of the oscillators 
are typically learned from training sequences that are not too hard to track [19, 
20, 6, 7]. To learn multi-class dynamics, a discrete state component labelling the 
class of motion is appended to the continuous state vector Xt to form a “mixed” 
state, and the dynamical parameters of each class and the state transition prob- 
abilities are learned from example trajectories. However, for the complicated 
motions exhibited by pedestrians walking in front of a moving car, it is not 
easy to identify different classes of motions that make up the actual motion. 
Moreover, we would like to make no assumptions about how the camera is mov- 
ing (translation, rotation, etc.) or about the viewing angle. We need a general 
model that is able to cope with the wide variety of motions exhibited by both 
the camera and the object being tracked, as well as the shape variability of the 
object. We propose using a zero-order motion model with large process noise 
high enough to account for the greatest expected change in shape and motion, 
since we now have a method of efficiently sampling high-dimensional spaces using 
quasi-random sequences. 

Given the sample set at the previous time step, we first choose 

a base sample with probability This yields a small number of highly 
probable locations, say M, the neighborhoods of which we must sample more 
densely. This has the effect of reducing 6 when the Jacobian term in Equation (1) 
is locally large, thereby achieving a more consistent distribution of error over the 
domain (importance sampling). If there were just one region requiring a dense 
concentration, an invertible mapping from a uniform space to the space of equal 
importance could be constructed, as given below in Equation (3) for the case of a 
multi-dimensional Gaussian. Since we have M regions, the importance function 
cannot be constructed in closed form. One therefore needs an alternative strategy 
for generating from the quasi-random distribution, a set of points that samples 
important regions densely. 

We have devised a simple yet effective strategy that achieves these objec- 
tives. Let the M locations have centers and variances based on the 
process noise, where these quantities are /c-dimensional vectors. We then overlay 
M + 1 distributions of quasi-random points over the space, with the first M 
distributions made Gaussian, centered at and with diagonal variance 
(3). Finally, we also overlay a {M + l)th distribution that is spread uniformly 
over the entire state space. This provides robustness against sudden changes in 
shape and motion. The total number of points used is N, where 

N = Ni J- N 2 + . . . + Nj^^i , 



( 2 ) 
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the sample size in the Condensation algorithm. We have in effect chosen by 
sampling from p(Xt/Xt_i = 

The conversion from a uniform quasi-random distribution to a Gaussian 
quasi-random distribution is achieved using the mapping along the tth dimension 

Vji = M?' + erf^^ ((2^j - 1)) , (3) 

where erf^^ is the inverse of the error function given by 

n pZ 

erf(2) = — ^ / e~* dt, 

Jo 

and represents the quasi-randomly distributed points in [0, 1]. 

Finally, we measure and compute the probabilities = p(Zt/Xt = 
for these new sample positions in terms of the image data Zj. We use a measure- 
ment density based on the multi-feature distance transform algorithm (see [3] 
for details) that has been successfully used for detecting pedestrians from static 
images. Therefore 



where the Zi’s are measurement points along the contour, I is the image data, 
and dtyped{zi,I) denotes the distance between Zi and the closest feature of the 
same type in I. We use oriented edges discretized into eight bins as the features 
in all our experiments. 

4 Results 

In order to investigate the effectiveness of quasi-random sampling we performed 
experiments using a simple synthetic example, as well as real video sequences of 
pedestrians taken from moving cars. Both sets of experiments demonstrated the 
expected improvements due to the use of quasi-random sampling. We describe 
these below. 



M 



logp(Zi/X*) =logp(Z/X) 



1 ,2 

'"typed 

2 = 1 



4.1 Synthetic Experiments 

We constructed the following simple tracking problem to illustrate the effective- 
ness of using quasi-random sampling as opposed to pseudo-random sampling for 
the Condensation tracker. The motion of an ellipse of fixed aspect ratio (ratio 
of axes) 



/ x-Xc (f) y ( y-Vc {t) y ^ 

\ a {t) J V J 



( 4 ) 
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Error distribution for x-coord of center 





Error distribution for y-coord of center 
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Frame number 



0 50 100 150 200 250 300 350 400 450 500 

Frame number 



(a) Error in Xc, and Cx 



(b) Error in yc, Cy^p and Cy 



Fig. 2. Error distributions vs. frame number. Light - PseudoRandom; Dark - QuasiR- 
andom. 



was simulated using a second-order harmonic oscillator model (4) independently 
in each of the ellipse parameters Xc, Vc and a. The ellipse translates and scales as a 
result of the combination of their motions. Reasonable values for the parameters 
of the oscillators were chosen manually. 

We ran the tracking algorithm described in Section 3, first with a stan- 
dard pseudo-random number generator and then with the quasi-random number 
generator for a given value of N (the Condensation sample size). The track- 
ing algorithm generates estimates for the ellipse parameters at each time step, 
namely Xcp{t),ycp{t) and a.p{t) in the pseudo-random case and Xcq{t),ycq{t) 
and dq{t) in the quasi-random case, from which the errors in the estimates 
e-x^p{t),ey^p{t),eap{t) (pseudo-random case) and ey,^q{t),ey^q{t),eaq{t) (quasi- 
random case) are obtained. A consistent and reliable value of the error in each 
dimension was obtained by performing M Monte Carlo trials with each type of 
generator (for quasi-random, using successive points from a single quasi-random 
sequence) for each N. All plots shown here are for a sequence of length 500 
frames and for 50 trials. Figure 2 shows the errors in the estimates of the center 
of the ellipse in all the 50 trials. The errors for both type of generators are plot- 
ted on top of each other. One can clearly see that the standard pseudo-random 
number generator leads to higher errors at almost every time step. To get a feel 
for how the sample size of the tracker affects the error rates resulting from the 
two sampling methods, the mean of the root mean square errors and the stan- 
dard deviation over the entire sequence are plotted against iV on a log-log scale 
(base 2). 

Figure 3 shows the plots of the average rmse and standard deviation errors 
in the estimation of the center coordinates of the ellipse, Xc and yc- From these 
experiments, as well as those described below, it can be seen that quasi-random 
sequences generally result in lower errors than standard random sequences. Fur- 
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High Process Noise 




log2(N) 

(a) Avg. RMSE in estimating Xc 



High Process Noise 




log2(N) 

(b) Avg. standard deviation error in Xc 



High Process Noise High Process Noise 





log2(N) 



log2(N) 



(c) Avg. RMSE in estimating j/c (d) Avg. standard deviation error in yc 

Fig. 3. Log-log plot of estimation error vs. N (sample size). * - PseudoRandom, + - 
QuasiRandom. 



thermore, for low values of iV, the errors for quasi-random sampling drop faster 
as the number of samples is increased, but as N gets very large, a saturation 
condition is reached, and a further increase in the sample size does not lead 
to comparable drops in the error rates, although they are still lower than in 
the pseudo-random case. These graphs thus show that for a given tolerance to 
error, quasi-random sampling needs a significantly smaller number of sample 
points (between 1/3 and 1/2 as many), thereby speeding up the execution of the 
algorithm considerably. 

Figure 4 shows similar plots for the low process noise case, where the effects of 
using quasi-random sampling are slightly reduced compared to the high process 
noise case. Finally, Figure 5 (not a log-log plot) shows the behavior of the error 
rates with increasing process noise for a fixed value of N . As the process noise 
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Low Process Noise 



Low Process Noise 





(a) Avg. RMSE in estimating Xc (b) Avg. standard deviation error in Xc 



Low Process Noise 



Low Process Noise 





(c) Avg. RMSE in estimating j/c (d) Avg. standard deviation error in yc 



Fig. 4. Log-log plot of estimation error vs. N (sample size). * - PseudoRandom, + 
QuasiRandom. 



increases, the superiority of quasi-random sampling becomes clearer and both 
the rmse and sd errors for pseudo-random sampling increase much more rapidly 
than their quasi-random counterparts. 

We have thus seen that using quasi-random sampling as the underlying ran- 
dom sampling technique in particle filters can lead to a significant improvement 
in the performance of the tracker. Even in a simplistic 3-D state space case 
such as that presented in this section, there is a sizable difference in the error 
rates. Furthermore, quasi-random sampling is actually more powerful in higher 
dimensions, as will be qualitatively demonstrated in the following section. We 
also note that adding noise to the simulations only helps the quasi-random case, 
since there are more clusters corresponding to multiple hypotheses which need 
to be populated efficiently. 
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(a) Avg. RMSE in Xc vs. process noise 



(b) Avg. sd error in Xc vs. process noise 





(c) Avg. RMSE in vs. process noise 



(d) Avg. sd error in vs. process noise 



Fig. 5. Estimation error vs. process noise (fixed N). * - PseudoRandom, + - QuasiR- 
andom. 



4.2 Tracking pedestrians from a moving vehicle 

We now present some results on tracking pedestrians from a moving vehicle using 
the techniques discussed above. First, a statistical shape model of a pedestrian 
was built using automatically segmented pedestrian contours from sequences ob- 
tained by a stationary camera (so that we can do background subtraction). We 
use well-established computer vision techniques (see [22] and [23]) to build a 
LPDM (Linear Point Distribution Model). We fit a NURB (Non-Uniform Ratio- 
nal B-spline) to each extracted contour using least squares curve approximation 
to points on the contour [21]. The control points of the NURBs are then used as 
a shape vector and aligned using weighted Procrustes analysis, where the con- 
trol points are weighted according to their consistency over the entire training 
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Fig. 6. Tracking failures using standard pseudorandom sampling. Dark - Highest prob- 
ability state estimate; Light - Mean state estimate. The quasi-random tracker was 
successful using the same number of samples. 



set. The dimensionality is then reduced by using Principal Component Analy- 
sis (PCA) to find an eight-dimensional space of deformations. Hence, the total 
dimension of Xt (the state variable) is 12 (4 for the Euclidean similarity param- 
eters and 8 for the deformation parameters). We used N = 2000 samples and 
the tracker was initialized in the first frame of the sequence using the pedestrian 
detection algorithm described in [3]. We introduced 10% of random samples at 
every iteration to account for sudden changes in shape and motion. We applied 
the tracker to several Daimler-Chrysler pedestrian sequences and found that the 
quasi-random tracker was able to successfully track the pedestrians over the en- 
tire sequence. The tracker was also able to recover very quickly from failures 
due to sudden changes in shape or motion or to partial occlusion. On the other 
hand, the pseudo-random tracker was easily distracted by clutter and was unable 
to recover from some failures. Figure 6 shows some frames where the pseudo- 
random tracker drifts and fails. For the same sequences with the same sample 
size, the quasi-random tracker was able to track successfully. Figures 7 and 8 
show the tracker output for two pedestrian sequences using the quasi-random 
tracker. In each frame, both the state estimate with the maximum probability 
and the mean state estimate are shown. 

5 Conclusions 

In this paper, we have addressed the problem of using the Condensation tracker 
for high-dimensional problems by incorporating quasi-Monte Carlo methods into 
the conventional algorithm. We have also addressed the problem of making the 
tracker work efficiently in situations where the motion models are unknown. The 
superiority of quasi-random sampling was demonstrated using both synthetic 
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Frame 4 



Frame 9 




Frame 12 Frame 17 




Frame 21 Frame 26 



Fig. 7 . Tracking results for Dalmler-Chrysler pedestrian sequence using quasi-random 
sampling. Dark - Highest probability state estimate; Light - Mean state estimate. 
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Frame 14 



Frame 19 




Frame 25 



Frame 33 





Frame 38 



Frame 49 



Fig. 8. Tracking results for Daimler-Chrysler pedestrian sequence using quasi-random 
sampling. Dark - Highest probability state estimate; Light - Mean state estimate. 
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and real data. Promising results on pedestrian tracking from a moving vehicle 
were obtained using these techniques. 

Monte Carlo techniques are used in other areas of computer vision where 
there is a need for optimization or sampling. The use of quasi-random points 
can be readily extended to these areas and should result in improved efficiency 
or speed-up of algorithms. 
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Abstract. Robustly tracking people in visual scenes is an important 
task for surveillance, human-computer interfaces and visually mediated 
interaction. Existing attempts at tracking a person’s head and hands deal 
with ambiguity, uncertainty and noise by intrinsically assuming a con- 
sistently continuous visual stream and/or exploiting depth information. 
We present a method for tracking the head and hands of a human subject 
from a single view with no constraints on the continuity of motion. Hence 
the tracker is appropriate for real-time applications in which the availa- 
bility of visual data is constrained, and motion is discontinuous. Rather 
than relying on spatio-temporal continuity and complex 3D models of 
the human body, a Bayesian Belief Network deduces the body part po- 
sitions by fusing colour, motion and coarse intensity measurements with 
contextual semantics. 



1 Introduction 



Tracking human body parts and motion is a challenging but essential task for 
modelling, recognition and interpretation of human behaviour. In particular, 
tracking of at least the head and hands is required for gesture recognition in 
human-computer interface applications such as sign-language recognition. Exi- 
sting methods for markerless tracking can be categorised according to the measu- 
rements and models used Pj . In terms of measurements, tracking usually relies 
on intensity information such as edges mmm, skin colour and/or motion 
segmentation or a combination of these with other cues including 

depth [1 3t25ll hfTj . The choice of model depends on the application of the tracker. 
If the tracker output is to be used for some recognition process then a 2D model 
of the body will suffice jl bl I I j . On the other hand, a 3D model of the body may 
be required for generative purposes, to drive an avatar for example, in which case 
skeletal constraints can be exploited mm, or deformable 3D models can be 
matched to 2D images [iUllYj . 

Colour-based tracking of body parts is a relatively robust and inexpensive ap- 
proach. Nevertheless the loss of information involved induces problems of noise, 
uncertainty, and ambiguity due to occlusion and distracting “skin-coloured” 
background objects. The two most difficult problems to deal with when tracking 
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the head and hands are occlusion and correct hand association. Occlusion occurs 
when a hand passes in front of the face or intersects with the other hand. Hand 
association requires that the hands found in the current frame be matched cor- 
rectly to the left and right hands. Most existing attempts at tracking cope with 
these problems using temporal prediction and/or depth information. Temporal 
prediction intrinsically assumes temporal order and continuity in measured data, 
therefore a consistent, sufficiently high frame rate is required. The use of depth 
information requires more than one camera and solution of the correspondence 
problem which is computationally non-trivial. 

We argue that robust, real-time human tracking systems must be designed 
to work with a source of discontinuous visual information. Any vision system 
operates under constraints that attenuate the bandwidth of visual input. In some 
cases the data may simply be unavailable, in other cases computation time is 
limited due to finite resources. A further and more significant computational 
constraint is associated with complexity and stability of behavioural models. 
Exhaustive modelling of the world would be prohibitively complex; rather it is 
more realistic to establish economical models or beliefs about the environment 
which are iteratively updated by visual observations. Since the models are not 
exhaustive, not all visual information requires processing. In fact, it may be un- 
desirable to absorb all available visual information into belief structures because 
instability, or “catastrophic unlearning”, may result. Therefore a robust vision 
system should be based on selective attention to filter out irrelevant informa- 
tion and use only salient visual stimuli to update its beliefs m- While selective 
attention is traditionally considered in the spatial domain, in this work we cast 
the notion into the temporal domain in order to relax the underlying constraint 
of temporal order and continuity required in tracking visual events over time. 

We achieve the goal of tracking discontinuous human body motion by repla- 
cing the problem of spatio-temporal prediction with reasoning about body-part 
associations based on contextual knowledge. Our approach uses Bayesian Belief 
Networks (BBNs) to fuse high-level contextual knowledge with sensor-level ob- 
servations. Belief networks are an effective vehicle for combining user-supplied 
semantics with conflicting and noisy observations to deduce an overall consistent 
interpretation of the scene. BBNs have been used previously as a framework 
for tracking multiple vehicles under occlusion using contextual information 
In PHI, a naive BBN was used to characterise and classify objects in a visual 
scene. For tracking body parts under discontinuous motion the BBN framework 
is ideal because unlike other tracking methods such as Kalman Altering or CON- 
DENSATION P21 that explicitly model the dynamics through change. Belief 
Networks model absolute relationships between variables and can make deduc- 
tive leaps given limited but significant evidence. Nevertheless, the accumulated 
beliefs still implicitly reflect all currently observed evidence over time. We de- 
monstrate that through iterative revision of hypotheses about associations of 
hands with skin-coloured image regions, such an atemporal belief-based tracker is 
able to recover from almost any form of track loss. In Section 2 we describe the 
context, assumptions and measurements used by the body tracker. In Section 
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3 we present the framework for combining these observations with contextual 
knowledge using BBNs. An experimental comparison of our tracker with a dy- 
namic tracker and a non-contextual tracker is presented in Section 4, and the 
conclusion is given in Section 5. 

2 Tracking Discontinuous Motion from 2D Observations 

The merits of any given behavioural modelling method are established according 
to the purpose for which it is used, therefore it is appropriate at this point to 
introduce the context for our tracking approach and the assumptions made. 
We are interested in modelling individual and group behaviours for visually 
mediated interaction using only a single 2D view, therefore depth information is 
unavailable. Behaviour models are used to interpret activities in the scene and 
change the view to focus on regions of interest. Therefore we have the luxury 
of not requiring full 3D tracking of the human body parts, which would rely 
on expensive matching to unreliable intensity observations. On the other hand, 
the system is required to simultaneously track several people which generally 
results in a variable and relatively low frame rate. From our experience with 
these conditions, a person’s hand, for example, can often move from rest to a 
distance half the length of their body between one frame and the next! Also, in 
images of manageable resolution containing several people (all images used in 
this work are 320 x 240 pixels), the hands may occupy regions as small as ten 
pixels or less wide, making appearance-based methods unreliable. 

To illustrate the nature of the discontinuous body motions under these condi- 
tions, Figure Eshows the head and hands positions and accelerations (as vectors) 
for two video sequences, along with sample frames. The video frames were sam- 
ples at 18 frames per second (fps). Even so, there are many significant temporal 
changes in both the magnitude and orientation of the acceleration of the hands. 
It may be unrealistic to attempt to model the dynamics of the body under these 
circumstances. We propose that under the following assumptions, the ambigui- 
ties and uncertainties associated with tracking a person’s discontinuous head 
and hand movement can be overcome using only information from a single 2D 
view without modelling the full dynamics of the human body: 

1. the subject is oriented roughly towards the camera for most of the time. 

2. the subject is wearing long sleeves. 

3. reasonably good colour segmentation of the head and hands is possible, and 

4. the head and hands are the largest moving skin colour clusters in the image. 

The robust visual cues used for tracking are now described, followed by a de- 
scription of the head-tracking and bootstrapping methods. 

2.1 Computing Visual Cues 

Real-time vision systems have two chief practical requirements: computational 
efficiency and robustness. Computational constraints exclude the use of expen- 
sive optimisation methods, while robustness requires tolerance of assumption 
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head left hand right hard 




Fig. 1. Two examples of behaviour sequences and their tracked head and hand positions 
and accelerations. At each time frame, the 2D acceleration is shown as an arrow with 
arrowhead size proportional to the acceleration magnitude. From left to right, the plots 
correspond to the head, left hand and right hand. 



violation. To meet these requirements we adopt a philosophy of perceptual fu- 
sion: independent, relatively inexpensive visual cues are combined to benefit from 
their mutual strengths and achieve some invariance to their assumptions |3 • The 
cues that are used to drive our body tracker are skin colour, image motion and 
coarse intensity information, namely hand orientation. Pixel-wise skin colour 
probability has been previously shown to be a robust and inexpensive visual cue 
for identification and tracking of people under varying lighting conditions 1221 - 
Skin colour probabilities can be computed for an image and thresholded to obtain 
a binary skin image, an example is shown in Figure 2(b) Here image motion 
is naively computed as the thresholded difference between pixel intensities in 
successive frames; an example is shown in Figure |2(c)| 

Skin colour and motion are natural cues for focusing attention and processing 
resources on salient regions in the image. Note that although distracting noise 
and background clusters appear in the skin image, these can be eliminated at 
a low level by “AND”ing directly with motion information. However, fusion of 
these cues at this low level of processing is premature due to loss of information. 
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Fig. 2. Example of visual cues measured from video stream, (a) original image; (b) 
binary skin colour image; and (c) binary motion image. 



The problem of associating the correct hands over time can usually be solved 
using spatial constraints. However, situations arise under occlusion in which 
choosing the nearest skin-coloured cluster to the previous hand position results 
in incorrect hand assignment. Therefore the problem cannot be solved purely 
using colour and motion information. In the absence of depth information or 3D 
skeletal constraints, we use intensity information to assist in resolving incorrect 
assignment. The intensity image of each hand is used to obtain a very coarse 
measurement of hand orientation which is robust even in low resolution imagery. 
The restricted kinematics of the human body are loosely modelled to exploit the 
fact that only certain hand orientations are likely at any position in the image 
relative to the head. 

The accumulation of a statistical hand orientation model is illustrated in 
Figure 0 Assuming that the subject is facing the camera, the image is divided 
coarsely into a grid of histogram bins. We then artificially synthesise a histo- 
gram of likely hand orientations for each 2D position of the hand in the image 
projection relative to the head position. To do this, a 3D model of the human 
body is used to exhaustively sample the range of possible arm joint angles in 
upright posture. Assuming that the hand extends parallel to the forearm, the 2D 
projection is made to obtain the appearance of hand orientation and position 
in the image plane, and the corresponding histogram bin is updated. During 
tracking, the quantised hand orientation is obtained according to the maximum 
response from a bank of oriented Gabor filters, and the tracked hand position 
relative to the tracked head position is used to index the histogram and obtain 
the likelihood of the hand orientation given the position. 



2.2 Head Tracking Using Mean Shift 

The first two constraints to be exploited are that the head is generally larger 
than the hands in the image, and that head movement is significantly more stable 
and moderate than hand motion. We track the head directly using an iterated 
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Fig. 3. Schematic diagram of the hand orientation histogram process. 



mean shift algorithm |^. This method converges on the local mode of the skin 



probability distribution. Despite its simplicity, the algorithm is very robust to 
occlusion by hands. The head is modelled as a rectangular region containing skin 
pixels. A search region is defined such that it is centred on the head box but is 
slightly larger. Given an initial/previous position (cx(t),Cy(t)), the algorithm is 
to iteratively calculate the spatial mean of skin pixels in the rectangular search 
region and shift the box to be centred on that estimated mean until it converges, 
as expounded in Figure 0 



where p = (j>x,Py) is a pixel, S is the set of skin pixels in 
the search region and Uskin = l^l . 

— Set search region centre to (cx (t) , Cy (t)) . 
until Cx{t) = Cx{t — 1) and Cy{t) = Cy{t — 1) . 



Fig. 4. The mean shift algorithm for tracking the head box. 



After convergence, the size of the head box is set according to the following 
heuristic: 



loop : 



Cx(t 1) Cx (t) , Cy (t 1) 







^ — y/ ‘^skin 



( 1 ) 
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h=1.2w (2) 

Note that the search region must be slightly larger than the head rectangle to 
avoid continual shrinking of the box, and to allow significant movement of the 
head without loss of track. 



2.3 Local Skin Colour Clusters 

Under the assumption that the head and hands form the largest moving connec- 
ted skin coloured regions in the image, tracking the hands reduces to matching 
the previous hand estimate to the skin clusters in the current frame. This as- 
sociation can be performed either at the pixel level or at a “cluster” level. At 
the pixel level, hands are tracked using local search via updating of spatial hand 
box means and variances (size). At the cluster level, a connected components 
algorithm is used to find all spatially connected sets of coloured pixels, which 
are subsequently treated as discrete entities. We have chosen to use the cluster 
representation for three reasons: 

— The pixel-level approach requires estimation of spatial means and variances 
of pixels which are quite sensitive to outliers. Even if medians are used 
instead of means, the hand box sizes are very sensitive to noise. 

— The local tracking approach requires heuristic search parameters, and is 
generally invalid for discontinuous motion since the hands may move a sig- 
nificant distance from one frame to the next. 

— Reasoning about hand associations is easier using the higher-level cluster 
representation. 

We used a connected components algorithm that has computational complexity 
linear in the number of skin pixels to obtain a list of skin clusters in the current 
frame. The components are drawn only from those portions of the region outside 
of an exclusion region defined by the head tracker box. The exclusion region is 
slightly larger than the head box due to protruding necklines or ears that can be 
mistaken for potential hand clusters. Clusters containing only a few pixels are 
assumed to be noise and removed. Finally the clusters are sorted in descending 
order of their skin pixel count for subsequent use. 



2.4 Initialisation 

Tracking is initialised by using skin colour to focus on areas of interest, then 
performing a multi-scale, multi-position identity-independent face search within 
these regions using a Support Vector Machine (SVM) [2D1- An example is shown 
in Figure El The SVM has been trained only on frontal and near-frontal faces, 
so it is assumed that the subject is initially facing approximately towards the 
camera. The mean shift head tracker is then initialised on the detected face 
region. Since the hands tracker only uses temporal association as a secondary cue, 
full tracking of the body can begin immediately after this partial initialisation. 
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Fig. 5. Example of the tracker initialisation using an SVM. 

Reasoning abont Body-Parts Association Using 
Bayesian Inference 



Given only the visual cues described in the previous section, the problem is now 
to determine the association of skin colour clusters to the left and right hands. 
One can consider this situation to be equivalent to watching a mime artist wea- 
ring a white face mask and white gloves in black clothing and a black background 
(see Figure |2(b)| . Further, only discontinuous information is available as though 
a strobe light were operating, creating a “jerky” effect (see Figure Un- 

der these conditions explicit modelling of body dynamics inevitably makes too 
strong an assumption about image data. Rather, the tracking can be performed 
better and more robustly through a process of deduction. This requires full ex- 
ploitation of both visual cues and high-level contextual knowledge. For instance, 
we know that at any given time a hand is either (1) associated with a skin colour 
cluster, or (2) it occludes the face (and is therefore “invisible” using only skin 
colour) as in Figures |6(b)| and [6(c)| or (3) it has disappeared from the image as 
in Figure |6(^ When considering both hands, the possibility arises that both 
hands are associated with the same skin colour cluster, as when one clasps the 



hands together for example, shown in Figure 6(e 



Clearly a mechanism is required for reasoning about the situation. In the 
next section, Bayesian Belief Networks (BBNs) are introduced as a mechanism 
for performing inference, after which we describe how BBNs have been applied 
to our tracking problem. 



3.1 Bayesian Belief Networks 

The obvious method of incorporating semantics into our tracking problem would 
be through a fixed set of rules. However there are two unpleasantries associated 
with this approach: brittleness and global lack of consistency. Hard rule-bases 
are notoriously sensitive to noise because once a decision has been made based 
on some fixed threshold, subsequent decision-making is isolated from the con- 
tending unchosen possibilities. Sensitivity to noise is undesirable in our situation 
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since we are dealing with very noisy and uncertain image data. The rule-based 
approach can also suffer from global consistency problems because commitment 
to a single decision precludes feedback of higher-level knowledge to refine lower- 
level uncertain observations or beliefs. 

An alternative approach to reasoning is based on soft, probabilistic decisi- 
ons. Under such a framework all hypotheses are considered to some degree but 
with an associated probability. Bayesian Belief Networks provide a rigorous fra- 
mework for combining semantic and sensor-level reasoning under conditions of 
uncertainty [TFTRj . Given a set of variables W representing the scenaricOl the 
assumption is that all our knowledge of the current state of affairs is encoded 
in the joint distribution of the variables conditioned on the existing evidence, 
P(w|e). Explicit modelling of this distribution is unintuitive and often infea- 
sible. Instead, conditional independencies between variables can be exploited 
to sparsely specify the joint distribution in terms of more tangible conditional 
distributions between variables. 

A BBN is a directed acyclic graph that explicitly defines the statistical (or 
“causal” ) dependencies between all variable^ These dependencies are known a 

^ Regarding notation, upper-case is used to denote a random variable, lower-case to 
denote its instantiation, and boldface is used to represent sets of variables. 

^ Therefore the statistical independencies are implicitly defined as well. 





Fig. 6. Examples of the difficulties associated with tracking the body, (a) motion is 
discontinuous between frames; (b) one hand occludes the face; (c) both hands occlude 
the face; (d) a hand is invisible in the image; and (e) the hands occlude each other. 
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priori and used to create the network architecture. Nodes in the network repre- 
sent random variables, while directed links point from conditioning to dependent 
variables. For a link between two variables, X ^ Y , the distribution P{y\x) in 
the absence of evidence must be specified beforehand from contextual knowledge. 
As evidence is presented to the network over time through variable instantiation, 
a set of beliefs are established which reflect both prior and observed information: 

BEL{x) = P{x\e) (3) 

where BEL{x) is the belief in the value of variable X given the evidence e. 
Updating of beliefs occurs through a distributed message-passing process that is 
made possible via exploitation of local dependencies and global independencies. 
Hence dissemination of evidence to update currently-held beliefs can be perfor- 
med in a tractable manner to arrive at a globally consistent evaluation of the 
situation. 

A BBN can subsequently be used for prediction and queries regarding values 
of single variables given current evidence. However, if the most probable joint 
configuration of several variables given the evidence is required, then a process 
of belief revisioi^ (as opposed to belief updating) must be applied to obtain the 
most probable explanation of the evidence at hand, w*, defined by the following 
criterion: 

P(w* |e) P(w|e) (4) 

where w is any instantiation of the variables W consistent with the evidence 
e, termed an explanation or extension of e, and w* is the most probable expla- 
nation/extension. This corresponds to the locally-computed function expressing 
the local belief in the extension: 

max 

P£'L*(a;) =Wx P(a:, w(^|e) (5) 



where = W - A. 

3.2 Tracking by Inference 

The BBN for tracking hands is shown in Figure 0 Abbreviations are: LH = left 
hand, RH = right hand, LS = left shoulder, RS = right shoulder, B1 = skin 
cluster 1, B2 = skin cluster 2. There are 19 variables, W = {Ai, A 2 , . . . , Aig}. 
The first point to note is that some of the variables are conceptual, namely Ai, 
A 2 , A 5 and Ag, while the remaining variables correspond to image-measurable 
quantities, e = {A 3 , A 4 , Ag, Ay, Ag, Aio, . . . , Aig}. All quantities in the network 
are or have been transformed to discrete variables. The conditional probability 
distributions attributed to each variable in the network are specified beforehand 
using either domain knowledge or statistical sampling. At each time step, all of 
the measurement variables are instantiated from observations. B1 and B2 refer 

^ The difference between belief updating and belief revision conies about because in 
general, the values for variables A and Y that maximise their joint distribution are 
not the values that maximise their individual marginal distributions. 



160 J. Sherrah and S. Gong 



to the two largest skin clusters in the image (apart from the head), obtained 
as per Section E3 Absence of clusters is handled by setting the variables 
and Xg to have zero probability of being a hand. The localised belief revision 
method is then employed until the network stabilises and the most probable 
joint explanation of the observations is obtained: 

Pi ' w * \{x3,X4, xe, XT, xs,xio,..., 2:19}) P(w|{a;3, 2:4, a;6, xt, X3,xio,..., 2:19}) 

( 6 ) 

This yields the most likely joint values of X\ and Xg, which can be used to set 
the left and hand box position. 

Note that the network structure is not singly connected, due to the loops 
formed through Xi and Xg. Consequently the simple belief revision algorithm 
of Pearl I2H cannot be used due to non-convergence. Instead, we apply the 
more general inference algorithm of Lauritzen and Spiegelhalter j I ditiliSj . This 
inference method transforms the network to a join tree, each node of which 
contains a sub-set of variables called a clique. The transformation to the join 
tree needs to be performed only once off-line. Inference then proceeds on the 
join tree via a message-passing mechanism similar to the method proposed by 
Pearl. The complexity of the propagation algorithm is proportional to the span 
of the join tree and the largest state space size amongst the cliques. The variables 
and their dependencies are now explained as follows. 




Fig. 7. A Bayesian Belief Network representing dependencies amongst variables in the 
human body-parts tracking scenario. 



Xi and Xg. the primary hypotheses regarding the left and right hand positions 
respectively. These variables are discrete with values {CLUSTERl, CLU- 
STER2, HEAD} which represent skin cluster 1, skin cluster 2 and occlusion 
of the head respectively. Note that disappearance of the hands is not model- 
led here for simplicity. 

AI 3 ; Xiq: the distance in pixels of the previous left/right-hand box position from 
the currently hypothesised cluster. The dependency imposes a weak spatio- 
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temporal constraint that hands are more likely to have moved a small di- 
stance than a large distance from one frame to the next. 

X4; Xii: the distance in pixels of the hypothesised cluster from the left/right 
shoulder. The shoulder position is estimated from the tracked head box. This 
dependency specifies that the hypothesised cluster should lie within a certain 
distance of the shoulder as defined by the length of the arm. 

^5, Xi2^ ^13^ Xi4^ ^15? Xq^ XiQ^ Xi^^ XiQ^ -^19' these variables determine 
whether each cluster is a hand. X^ and Xg are boolean variables specify- 
ing whether or not their respective clusters are hands or noise. The variables 
have an obvious dependency on Xi and Xg: if either hand is a cluster, then 
that cluster must be a hand. The descendants of X5 and Xg provide evidence 
that the clusters are hands. X12 and X19 are the number of skin pixels in 
each cluster, which have some distribution depending on whether or not the 
cluster is a hand. X13 and Xig are the number of motion pixels in each 
cluster, expected to be high if the cluster is a hand. Note that these values 
can still be non-zero for non-hands due to shadows, highlights and noise on 
skin-coloured background objects. X14 and X17 are the aspect ratios of the 
clusters which will have a certain distribution if the cluster is a hand, but 
no constraints if the cluster is not a hand. X15 and Xig are the spatial areas 
of the enclosing rectangles of the clusters. For hands, these values have a 
distribution in terms relative to the size of the head box, but for non-hands 
there are no expectations. 

Xq and X7: the number of moving pixels and number of skin-coloured pixels 
in the head exclusion box respectively. If either of the hands is hypothesised 
to occlude the head, we expect more skin pixels and some motion. 

Xg: orientation of the respective hand, which depends to some extent on its 
spatial position in the screen relative to the head box. This orientation is 
calculated for each hypothesised hand position, and the histogram described 
in Section o is used to assign a conditional probability. 

Under this framework, all of the visual cues can be considered simultaneously 
and consistently to arrive at a most probable explanation for the positions of 
both hands. BBNs lend the benefit of being able to “explain away” evidence, 
which can be of use in our network. For example, if the belief that the right 
hand occludes the face increases, this decreases the belief that the left hand also 
occludes the face because it explains any motion of growth in the number of skin 
pixels in the head region. This comes about through the indirect coupling of the 
hypotheses Xi and Xg and the fixed amount of probability attributable to any 
single piece of evidence. Hence probabilities are consistent and evidence is not 
“double counted” izq. 

4 Experimental Evaluation 

An experimental evaluation of the atemporal belief-based tracker is now presen- 
ted. First, examples of the tracker’s behaviour are given, then a comparison is 
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performed between the BBN tracker and two other tracking methods. Note that 
to make our point about the difficulty of discontinuous motion more poignant, 
we captured all video data at a relatively high frame rate of 18 fps and used 
off-line processing. 



4.1 Tracker Performance Examples 



Selected frames from four different video sequences consisting of 141 to 367 fra- 
mes per sequence are shown in Figure^ Each sub-figure shows frames from one 
sequence temporally ordered from left to right, top to bottom. It is important to 
note that the frames are not consecutive. In each image a box frames the head 
and each of the two hands. The hand boxes are labelled left and right, showing 
the correct assignments. In the first example. Figure 8(a) the hands are accu- 
rately tracked before, during and after mutual occlusion. In Figure |8(b)t typical 



coughing and nose-scratching movements bring about occlusion of the head by 
a single hand. In this sequence the two frames marked with “A” are adjacent 
frames, exhibiting the significant motion discontinuity that can be encountered. 
Although the frame rate was high, this discontinuity came about due to disk 
swapping during video capture. Nevertheless the tracker was able to correctly 
follow the hands. In Figure |8(c)| the subject undergoes significant whole body 
motion to ensure that the tracker works while the head is constantly moving. 
With the hands alternately occluding each other and the face in a tumbling ac- 
tion, the tracker is still able to follow the body parts. In the third-to-last frame 
both hands simultaneously occlude the face. The example of Figure |8(d)| has the 
subject partially leaving the screen twice to fetch and then offer a book. Note 
that in the frames marked “M” one hand is not visible in the image. Since this 
case is not explicitly modelled by the tracker, occlusion with the head or the 
other hand is deduced. After these periods of disappearance, the hand is once 
again accurately tracked. 



4.2 Comparison with Dynamic and Non-contextual Trackers 

We compared the atemporal belief-based tracker experimentally with two other 
tracking methods: 

dynamic: assuming temporal continuity exists between frames over time and 
linear dynamics, this method uses Kalman filters for each body part to match 
boxes at the pixel level between frames, 
non-contextual: similar to the belief-based method, this method assumes tem- 
poral continuity but does not attempt to model the dynamics of the body 
parts. The method matches skin clusters based only on spatial association 
without the use of high-level knowledge. 

It is difficult to compare the tracking methods fairly in this context. Compa- 
rison of the average deviation from the true hand and head positions would be 
misleading because of the all-or-nothing nature of matching to discrete clusters. 



Tracking Discontinuous Motion Using Bayesian Inference 163 




(a) 




(b) 





(d) 

Fig. 8. Examples of discontinuous motion tracking. 
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Another possible criterion is the number of frames until loss-of-track, but this is 
somewhat unfair since a tracker may lose lock at the start of the sequence and 
then regain it for the rest of the sequence. The criterion we chose for comparison 
is the total number of frames on which at least one body part was incorrectly 
tracked, or the hands were mismatched. The comparison was performed on 14 
sequences containing two different people totalling 3300 frames. 

Table □ shows the number of frames incorrectly tracked by each method, in 
absolute terms and as a percentage of the total number of frames. The belief- 
based tracker performs significantly better than the other two methods, even 
though the data was captured at a high frame rate. Therefore the benefits of 
using contextual knowledge to track discontinuous motion by inference rather 
than temporal continuity are significant. One would expect even better impro- 
vements if low frame-rate data were used. The most common failure modes for 
the belief-based and non-contextual trackers were incorrect assignment of the 
left and right hands to clusters, and locking on to background noise when one 
hand was occluded. The dynamic tracker often failed due to inaccurate tem- 
poral prediction of the hand position. Two examples of this failure are shown 
in consecutive frames in Figure 0 Although one could use more sophisticated 
dynamic models, it is very unlikely they will ever be able to feasibly capture 
the full gamut of human behaviour, let alone accurately predict under heavily 
discontinuous motion. For example, the body-parts tracker in switches in 
appropriate high-level models of behaviour for improved tracking, but the com- 
putational cost increases with the number of possible behaviours modelled. In 
terms of processing speed, all trackers had approximately the same performance. 
The average frame rate was about 4 fps on a PII 330 using 320x240 images. 



method 


incorrect frames 




number 


% 


belief-based 


439 


13 


dynamic 


728 


22 


non-contextual 


995 


30 



Table 1. Comparative results of the three tracking methods. 



5 Conclusion 

Observations of body motion in real-time systems can often be jerky and di- 
scontinuous. Contextual knowledge can be used to overcome ambiguities and 
uncertainties in measurement. We have presented a method for tracking discon- 
tinuous motion of multiple occluding body parts of an individual from a single 
2D view. Rather than modelling spatio-temporal dynamics, tracking is perfor- 
med by reasoning about the observations using a Bayesian Belief Network. The 
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(b) 



Fig. 9. Two examples of the failure of the dynamic Kalman filter tracker. 



BBN framework performs bottom-up and top-down message passing to fuse both 
conceptual and sensor-level quantities in a consistent manner. Hence the visual 
cues of skin colour, image motion and local intensity orientation are fused with 
contextual knowledge of the human body. The inference-based tracker was tested 
and compared with dynamic and non-contextual approaches. The results indi- 
cate that fusion of all available information at all levels significantly improves 
the robustness and consistency of tracking. 

We wish to extend this work in two ways. First, the tracker can be made 
adaptive so that no parameters need to be changed when different people are 
tracked. Second, the current tracker assumes that there is only one person in the 
field of view, but we wish to use the tracker in scenes containing several people. 
We will investigate how trackers can be instantiated as people enter the scene, 
and how the tracker networks can be causally coupled so that skin clusters can 
be explained away by one network and not considered by the other networks. 
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Abstract. Docking is a fundamental requirement for a mobile robot in 
order to be able to interact with objects in its environment. In this pa- 
per we present an algorithm and implementation for a special case of 
the docking problem for ground-based robots. We reqnire the robot to 
dock with a fixated environment point where only visual information is 
available. Specifically, camera pan/tilt information is unknown, as is the 
direction of motion with respect to the object and the robot’s velocity. 
Further, camera calibration is unavailable. The aim is to minimise the 
difference between the camera optical axis and the robot heading direc- 
tion. This constitutes a behaviour for controlling robot direction based 
on fixation. This paper presents a fnll mathematical derivation of the 
method and implementation used. In its most general form, the method 
requires partial segmentation of the optical flow field. The experiments 
presented, however, assnme partial knowledge as to whether points are 
closer to the camera than the fixation point or further away. There are 
many scenarios in robotic navigation where such assumptions are typical 
working conditions. We examine two cases: convex objects; and distant 
backgronnd/fioor. The solution presented uses only the rotational com- 
ponent of optical flow from a log-polar sensor. Results are presented with 
real image and ray-traced image seqnences. The robot is controlled based 
on a single component of optical flow over a small portion of the image, 
and thus is suited to real-time implementation. 

Keywords: Active vision and real-time vision, vision-gnided mobile ro- 
bots, and docking. 



1 Introduction 

Docking is a fundamental requirement for a mobile robot to interact with ob- 
jects in its environment. In order to perform operations such manipulation (e.g. 
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autonomous fork-lifts [H|), or industrial assembly IQil, a mobile robot must be 
able to dock. In this paper, we present the derivation and implementation of an 
active behaviour for controlling robot heading direction to support docking with 
a fixated environment point. Only visual information is required (i.e. pan/tilt 
information, and robot’s heading direction and velocity are unknown). Further, 
camera calibration is not required. This method does not require knowledge of 
the object, other than a constraint on the distribution of the depth of points 
relative to the fixation point. Previously, we demonstrated that it was sufficient 
that the object be convex and centred in the image, or that the majority of 
background points be behind the fixated object |2j. This method can be used 
as a behaviour that is independent of fixation and high-level planning. The ro- 
bot need only fixate on a point in the desired heading direction and invoke the 
behaviour, and then the robot will turn as required. This is elegant from an 
architectural viewpoint for general docking, but also facilitates systems where 
fixation is entirely separate from platform control. For example, consider a situa- 
tion where fixation is controlled by a human operator with a camera attached 
to a head-set and no information is available about the head-set position. 

This research will be integrated with our existing mobile robot system for 
circumnavigation Pj, which uniquely identifies objects and moves around them. 
Integration of docking will facilitate close inspection and manipulation. 

Fixation systems control camera direction to keep the projection of a scene 
point that is moving relative to the camera in a fixed position in the image. 
Fixation is fundamental to active vision, and as such there are many approaches 
available (e.g. unni). Fixation can be used to facilitate perception of general 
motion pj. Research with human subjects has shown that pedestrians can use 
fixation to gain information about their instantaneous heading direction pj. 

Docking is a difficult problem, that is often handled with hardware solutions 
such as tactile sensors m, or by using extensive knowledge about the visual 
properties of the object and camera Santos-Victor and Sandini CHI 

present an active approach to docking for a mobile robot that corrects heading 
direction, however, this assumes that the docking surface is planar. 

2 The Log-Polar Sensor 

Schwartz derived an analytical formulation of biological vision systems 

based on experimental measures of the mapping from the retina to the visual 
cortex of monkeys. Visual data is transformed from the retinal plane in polar 
coordinates (p,0) to log-polar Cartesian coordinates (<^, 7 ) in the cortical plane. 
The relation can be expressed: 







7 = QV, 



( 1 ) 



where (p, 77 ) are the polar coordinates of a point on the retinal plane and po, 
q, and a are constants determined by the physical layout of the sensor. Thus, 
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(a) (b) 



Fig. 1. The log-polar sensor samples 64 evenly spaced angles, at 32 radii. The sensing 
elements increase in size toward the periphery, (a) sensor geometry; (b) image geometry. 



sensing elements appear in a non-uniform distribution, with a high density at 
the central fovea, and continuously decreasing density toward the periphery. 

A CMOS implementation simulating this type of sensor has been realised m. 
Figure n shows a log-polar image and its Cartesian reconstruction, illustrating 
the high-resolution at the fovea, and low resolution in the image periphery. 

Jain im pointed out the advantages of using optical flow from a log-polar 
complex mapping for depth recovery from a translating camera with known 
motion parameters. The benefits of space- variant sensors for calculation of time- 
to-impact have also been demonstrated ITTO . The method presented here de- 
monstrates advantages for control of motion orthogonal to the image axis. The 
log-polar sensor parameterisation enables closed-loop robot heading direction 
control based directly on the rotational component of log-polar optical flow. 

3 Theoretical Background 

Consider a robot moving in three space, with a velocity vector W = {W^, Wy, Wz) 
(see Figure 0 ). A camera mounted on the robot can move about all three axes 
with rotational velocities of = a; = (0,(/),'0), about {x,y,z) respectively. 

Consider a point P on an object in the camera held of view, specified in 
camera coordinates {x, y, z). Sandini and Tisarelli derive the motion of P as 
rotational and translational components as follows. From the inverse perspective 
transform, the projected point on the image plane given a focal length F is: 

P =[x,y] = —[x,y], (2) 

z 

Differentiating (EJ with respect to time, we may decompose the velocity vec- 
tor into a component due to camera translation and due to camera rotation: 
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Object 




Fig. 2. (a) The coordinate system for a robot that moves in 3 space, (b) The coordinate 
system for the ground-based robot. 



V 

Vt 

Vr = 



= Vt + Vr 

xW^-FWx yW^-FWy 

z > z 



xy4>-[x^+F'^]9+yil) [y'^+F^]4i-xye-xi/j 

F ’ F 



(3) 

(4) 



Now we derive equations for general motion observed by a log-polar sensor, 
following Tistarelli and Sandini EH- The velocity in the image plane can be 
described in terms of radial and angular coordinates: 



XU + yv 

p = = u cos ?7 -I- u sm r] 

P 

XV + yu V cos T] — u sin rj 

V = ^ — = 

P P 

Substituting the motion equations o and 



(5) 

( 6 ) 



P = 



n — I xW^-FW:c xyil>-lx^+F^]e+Fy(t> 



j cos r; -f ^ 



yWz—FWy . [y^-\-F^]ilj—xyd—Fx(i) 
Z ^ F 



) sinr;(7) 



?y = i ^^ y'V.-FWy ^ _ ^ xW,-FW^ (3) 

By substituting x = p cos p and y = p sin p: 

p — ^[pWz — F(Wx cos 77 -I- Wy sin 77)] -|- {ipsinp — Ocosp) (9) 



77 = 



-I- 0) sin 77 -I- i(j) ^ ) cos?7 



( 10 ) 
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However, the retinal sensor performs a logarithmic mapping as shown in 
Equation thus we have m 



t = J (W:, cos ^ + Wy sin ^)] + {§. + |)((/)sin^ - 0COS ^)]Zo5ae(ll) 

7= ^[(^ + ^)sin^-k((/'- ^)cos^]-#- (12) 

Now let us consider the case of a ground-based robot that has a camera 
mounted on a pan-tilt platform, with no capacity for rotation about the optical 
axis (FigureEI). Thus, the robot moves with motion vector W = {Wx,Wz), and 
the camera has an angular velocity vector uj = given a combination of 

robot and pan/tilt platform motion. Consider also that the robot camera fixates 
independently on a target point in the environment. The pan and tilt velocities 
and absolute direction of the head are unknown, as is the absolute direction of 
the robot. Assume, without loss of generality, that the target object lies along 
the z axis, where the origin is a fixed to the robot along the optical axis. 

We would like the robot to dock with the fixation point. To achieve this goal, 
the robot must adjust its heading direction such that Wx is zero. For a ground- 
based robot, the magnitude of Wy is not important. The issue of controlling the 
magnitude of velocity has been addressed previously H3EU. 

If the robot is tracking a point in the environment, then from m we have: 



e = 



-Wx 

D 



D ’ 



(13) 



where D is the distance to the point of fixation. 

Substituting Equations JED into Equations (HU and (HU to eliminate 0 and 
cj), removing the redundant '0, we obtain: 



C = [^ + it(f + f [1 - ^WxCOS 7 - w,sin 7)]Zog,e, (14) 

7= f£(l-f)(W,cos7-W,sin7). (15) 

For Equation (HU, if the object is not a long way above the ground relative 
to the robot-to-object distance, then the Wy residual from camera tilt will be 
small. Further, consider log-polar image region in which cos 7 is close to zero, 
and sin 7 is maximal, as shown in Figure 0 and specified in Equation (II till . 

where k is & constant specifying the width of the sensor region over which the 
mean is taken. Provided k is small, the coefficient of Wy is close to zero. 

Thus, we may approximate 7 in this region as: 
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Fig. 3. 



Wx dominates 7 over the region shown in Equation m- 




Fig. 4 . The intersection of the camera field of view and a sphere centred at the camera 
focal point with a radius the size of the fixation distance. The sign of 7 is dependent 
on which side of this volume the world point appears. 



From Equation ^3 we can see that 7 is directly proportional to Wx for this 
part of the image. The sign of 7 is dependent on the sign of and whether 
D > Z for the given point. With no assumptions about scene formation D > Z 
is unknown for any image point. Ideally, we could adjust the heading direction in 
either direction and check if 7 reduces also. However, for the optical flow method 
used, the magnitude of 7 can vary greatly for similar values of W^, and so the 
second order derivative of motion cannot be determined with sufficient accuracy. 

Note that in the region at the sides of the image, where cos ^ is maximal, 

the effect on ^ of Wx will be maximal, however, for robot motion that is largely 
towards the object, ^ will be dominated by the expansion component, Wz, and 
so is less well suited than 7 for direct use for control. 



3.1 Geometric Constraints 

As a log-polar sensor has high resolution at the fovea, and this resolution decrea- 
ses toward the periphery, it is natural to fixate the object of interest at the centre 
of the image, i.e. the optical axis should point towards the object. 

By definition, D > Z for a given point P if the distance from the camera 
focal point is greater than that of the fixation point. By assuming the fixation 
point lies along the optical axis, we may define the region that is closer to the 
camera than the fixation point. It is the region contained within the intersection 
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of the cone defining camera visibility, and a sphere centred at the focal point, 
with radius equal to the fixation distance (see Figure 2J- See |2| for an analysis. 

In order to deduce robot heading angle relative to the optical axis we must be 
able to segment the visible scene into parts contained within this cone, or beyond 
it. A segmentation-based solution may be possible, however, in order to maintain 
processing speed, we assume constraints on scene geometry that allow 7 to be 
used directly. Many assumptions are possible for particular situations, however, 
we wish to maintain the generality of the system. Two possible assumptions that 
have general application for ground-based mobile robots are: 

1. The fixated object is convex and large in the image. Thus, most visible points 
are behind the fixation point from the robot’s view point. 

2. The fixation point is on the ground, such that all image points below it are 
ground points that are closer to the robot, and the majority of points above 
the object are background, and so behind the object. (This assumes the 
space between the object and the robot is not cluttered with other objects). 

To clarify the theory above. Figure]^ shows optical flow for the motion dis- 
cussed, and for several types of component motions. 

4 Implementation 

Many methods are available for calculating optical flow, with different strengths 
and weaknesses. See 0 for a comprehensive review of methods, and mu for a 
review of the performance of these methods. Two major considerations drive our 
choice of method for calculating optical flow: 

— Mobile robot docking is an on-line task, so any method must be capable of 
real-time performance; and, 

— Our basic assumption on scene geometry does not facilitate the use of model- 
based methods. 

The first consideration requires a fast method that avoids excessive compu- 
tation. It should not deal with the whole image if a restricted part of the image 
will suffice. Although overall navigation performance must be robust to avoid 
unnecessary deviations, the optical flow method need not produce exact results. 
We avoided methods involving flow reconstruction, and instead used local flow 
calculation, relying on aggregation over part of the image to handle noise. 

If knowledge is available about geometry of the docking surface, then methods 
can be applied that are based on assumptions such as that the surface is a first 
order function of image coordinates m- However, we wish to investigate a more 
general case where the only assumed knowledge is the approximate distance of 
points with respect to the fixation point. 

The method chosen was that of liras et. al. which does not require 
assumptions about surface shape, is simple enough for fast implementation and 
is competitive in terms of accuracy m- Other fast local methods may also be 
appropriate. The method uses a local solution to the second order derivative: 
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(a) (b) (c) 




Fig. 5. Optical flow in polar coordinates, (a) The object moving closer to the camera, 
(b) Clockwise rotation in the plane, (c) Robot translation perpendicnlar to the image 
axis, while fixating on a point in front of the object, (d) Flow in Cartesian coordina- 
tes resulting from translation perpendicular to the image axis combined with motion 
toward the object, while fixating in front of the object, (e) In log-polar coordinates. 



^xx lyx 
Ixy lyy 



Ity 



( 18 ) 



In a log-polar image, values for optical flow can be calculated at each pixel. 
However, only angles where Wx dominates 7 are used, as specifled in Equation 
(unj. Further, the log-polar sensor is a multi-scale device. The distance from 
the centre of the image at which the flow will have the most favourable signal 
to noise ratio is dependent on the flow scale. The flow was taken between two 
values pu and pi which were set manually dependent on geometry of the parti- 
cular situation. Automated setting of scale is not addressed in this paper. We 
largely ignored noise in the flow calculation to facilitate speed, aiming instead 
for tolerance of noise in the input flow. We used the sign of the mean of 7 (see 
Equation o to control heading direction, in a closed loop. 

Note that there must be sufficient net optical flow for the signal-to-noise ratio 
to be adequate for robust control. For this to be the case, the depth between the 
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fixation point and the other image points must not be small with respect to the 
robot-to-object distance. For the second assumption, of a small object on the 
floor in a room, net flow will generally be sufficient given a distant background. 
However, as the robot moves close, the object may become large in the image so 
that little background is visible. Hardware final docking solutions may be appro- 
priate at this stage (e.g. P). Alternatively, for the convex object assumption, 
the object must not be small with respect to the robot-to-object distance. 

Pu 

P=PI7=f-fc P=Pl-/=S^-k 



5 Results 

We present experiments with real and simulated image sequences. All images 
are 256x256 Cartesian images that are subsampled into log-polar coordinates 
according to Equation (P. The simulated sequences show closed loop system 
performance. We have not yet completed the implementation on our mobile 
platform, however, we have taken real image sequences that confirm that real 
data does behave in the same manner as the simulated data. Thus, the system 
should perform correctly when the loop is closed. 



5.1 Ray Traced Robot Simulations 

Image sequences of textured objects were generated using the POV-Ray ray 
tracing package. These sequences generate noisy optical flow patterns as flow 
calculation incurs many of the difficulties that apply to real images. Simulation 
allow us to examine precise cases, enabling full evaluation of the mathematical 
theory. For example, it is difficult to ensure that the fixation point remains 
constant on the surface of a plane using real data. 

For these experiments we used a robot simulator. The simulator takes an 
image every p msec, moving forward only with fixed velocity. The heading di- 
rection can be changed by applying an angular velocity. This is represented as 
an average value for the time interval. Thus, for an applied angular velocity of 
a radians per msec, the robot will turn ap radians during the interval. 

Early simulated results have been presented previously (2j. The results pre- 
sented here show more extensive trials of relevant situations. 



Convex Object Assumption The algorithm was tested on two convex objects, 
using the assumption that all points are behind the fixation point. Figure EJ a) 
shows the first object, and (b) and (c) show the heading direction with respect 
to the optical axis against total distance travelled. The robot overshoots the zero 
heading direction in both cases. This is due partly to the fact that it is getting 
close to the object. It should be noted that the depth variation within the object 
is small in comparison with the distance from the object, the resulting flow is 
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(b) 



(c) 



Fig. 6. Docking with a stone-surfaced ellipsoid, (a) The object, (b) Docking when the 
initial direction of motion is at ^ radians to the optical axis, (c) Initial direction of 
motion is at ^ radians to the optical axis in the opposite direction. 



also small, resulting in the noise seen in the experiment. Similar results were 
found for a convex polyhedral object with a brick- like surface. 



Background assumption The second case is where the robot fixates on a 
small object in the foreground, and we assume that all points above the object 
are background (see Figured (a)). The lower half of the image is floor, and a 
planar wall covers most of the upper half. This is a plausible setup for a ground- 
based robot attempting to dock with an object in a room with a background wall, 
and a fiat floor. In this case, the system assessed only points in the upper part 
of the image. The floor could also have been used, however this configuration 
was chosen to be consistent with that used for the real image sequences. Figure 
m and (c) show plots of heading direction angle. In this case, convergence 
was faster and more stable, because the image points for which 7 was calculated 
are further behind the object. Similar performance was also demonstrated with 
ray-traced images where all the points were in front of the fixation point. 

Although theoretically, the algorithm should be able to correct heading di- 
rection with a planar surface, trials performed were unable to support this. 
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(b) 



(c) 



Fig. 7 . Docking with an object with a distant background, (a) At the starting point, 
the object is small in the image, with a floor plane for all points below the object, and 
a background wall for all points behind the object, (b) Docking when motion begins 
at ^ radians to the optical axis, (c) Docking when motion begins at ^ radians. 

Mathematically there is a net rotational flow, however, in practice it appears 
this component is too small to be reliably extracted due to noise in this case. 

5.2 Real Images 

The object shown in Figure El object moved in a straight line along a rail. While 
it moved it was fixated by the LIRA head jS|, using a colour-based binocular 
fixation method under development at the LIRA-lab. The images shown were 
taken from one of the cameras. Two types of image sequence were taken: 

— The object moves from the right to the left of the image, toward the robot. 

— The object moves from the left to the right of the image, toward the robot. 

Figure El shows a sequence where the object moves at an angle such that 
in the coordinate system of Figure 0 would be positive. With the object 
moving, motion is similar to when the robot moves, but the background is far 
away. This exemplifies the assumption that all above the object is background. 
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Fig. 8. A clown mask is fixated as it moves along a rail from right to left. 

Figure 0shows log-polar versions of the Cartesian images of Figure 0(c) and 
(d), and the resulting log-polar optic flow. The scale of motion for this sequence 
is large, so the symmetric regions where 7 was taken were towards the periphery 
of the image. Figure ITTII shows the mean of 7 for this sequence. Figure ITTI shows 
a similar sequence, where the rail is at an angle with respect to the camera such 
that Wx would be negative. Figure El shows the mean of 7 . 

These sequences show the sign of the mean changes with the direction of 
Wx- Due to slow recording of images, and minimum movement requirements 
between frames for the fixation algorithm, the changes between the images are 
large. More stable results can be expected with a higher sampling rate. 

6 Conclusion 

In this paper, we derived an algorithm which could be used to control heading 
direction for docking with an independently fixated object based on a class of 
assumptions about general scene properties. We demonstrated the algorithm’s 
effectiveness on simulated and real images. This algorithm could be used as the 
basis of a docking behaviour, whereby a mobile robot need only fixate on a point 
and invoke the behaviour and it will move toward the point. 

As this paper is early research in an interesting area, there is more work 
to be done. We are currently implementing this algorithm on a mobile robot 
to close the loop with real images. Further, as log-polar images are multi-scale, 
on-line automated determination of the optimal scale for heading control would 
be useful. Finally, this method is based on assumptions about the scene, even 
if these are high-level assumptions. Image segmentation into regions that are 
closer than the fixation point or otherwise would allow docking without scene 
knowledge. 
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(a) (b) 



(c) 



Fig. 9. Log-polar images of (c) and (d) in Figure 0 and the resulting optical flow. 




Fig. 10. Mean value of 7 for the image sequence of Figure|3 
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Abstract. The initial development and assessment of an active computer vision 
system is described, which is designed to meet the growing demand for 3 
dimensional models of real-world objects. Details are provided of the hardware 
platform employed, which uses a modified gantry robot to manoeuvre the 
system camera and a purpose-built computer controlled turntable on which the 
object to be modelled is placed. The system software and its computer control 
system are also described along with the occluding contour technique 
developed to automatically produce initial models of objects. Examples of 
models constructed by the system are presented and experimental results are 
discussed, including results which indicate that the occluding contour technique 
can be used in an original manner to identify regions of the object surface 
which require further modelling and also to determine subsequent viewpoints 
for the camera. 



1 Introduction 



An active computer vision system can be described as a vision system in which the 
camera, or cameras, are moved in a controlled and purposive manner in order to 
capture images from different viewpoints. In many cases the purpose is to fixate on a 
feature of a moving target in order to track the target in real time [1]. In other cases, 
features in the environment are located in order to provide information for the real- 
time navigation of a robot or robot vehicle [2]. Research relating to both these 
applications has tended to concentrate on the development of high-speed, multiple 
degree of freedom platforms on which the camera, or cameras, are mounted [3,4]. 

Another potential application of active vision, which has received less attention, is 
the construction of 3 dimensional models of objects by using a moving camera to 
“explore” the surface of the object. Here the requirement is not for a high-speed 
camera platform, but for a platform which can position the camera accurately and for 
an accurate method of converting images into a 3 dimensional model of the object in a 
real-world co-ordinate system. Such fundamental differences suggest that a separate 
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line of research is required to support the development of practical active vision 
systems for 3 dimensional modelling. This would undoubtedly be worthwhile as the 
demand for 3 dimensional models is growing and will continue to grow with the more 
widespread use of 3 dimensional CAD and computer graphics software. 

One key role for a system which can produce 3 dimensional models would be the 
reverse engineering of manufactured parts. Often the latter do not have an associated 
CAD model, yet there are many situations where one would be useful, as a basis for 
designing and manufacturing a similar or modified part. It is also common for the 
design of parts to be altered during the period between initial design and production 
and it would be desirable if the final version of the part could be used to update the 
original model. In addition a potential role for reverse engineering exists in the 
manufacture of moulds and dies, where traditionally crafted masters have been used 
in the production process. If a CAD model were created from the crafted master then 
modern CNC machines could be used more frequently to manufacture the tool [5]. In 
the case of jewellery and other decorative objects, the availability of CAD models 
would allow new designs to be created without the use of crafted masters [6]. Apart 
from reverse engineering it is evident that there is a growing demand for realistic 
models of both manufactured and natural objects for multimedia, film and virtual 
reality environments. Clearly computer graphics applications seldom need models as 
accurate and detailed as models required for CAD purposes and the realistic 
reproduction of surface properties, such colour and texture, is equally important. 

To date the requirement for 3 dimensional models has been met by commercially 
available laser scanners and co-ordinate measuring machines (CMMs). However, the 
use of a CMM to digitise sufficient points to create a 3 dimensional model is a 
laborious task, which is difficult to automate. Conventional CMMs are also unsuitable 
for soft or flexible objects and return no information on the properties of a surface 
other than its geometry. The use of a laser scanner can also be problematic, since it 
may not be possible to access some regions of the object’s surface, the performance of 
the scanner is dependent on surface properties and details of surface characteristics 
such as colour and texture are difficult to acquire. In theory computer vision could 
provide a superior approach. The co-ordinates of surface points could be obtained 
quickly, efficiently and in an automated manner. Soft and flexible objects could be 
modelled and surface characteristics including colour and texture could be reproduced 
if necessary. 

For the most part, work on the use of computer vision for constructing 3 
dimensional models has relied on images obtained from multiple static cameras [7] or 
a static camera with the object rotated on a turntable [8,9]. Some research has been 
reported on the use of continuously moving cameras to obtain 3 dimensional data by 
tracking occluding contours [10] or to determine object structure from the known 
motion of the camera [11]. Various methods have also been proposed to define the 
optimum viewpoints (camera positions and orientations) for model construction 
[9,12], but there is a lack of experimental work to assess the methods proposed. 

The current paper reports the initial results of a research programme to develop a 
practical active vision system for automated 3 dimensional modelling. The prototype 
system was to be built to model objects which would fit within a 100 mm cube and 
the system was to be assembled from relatively low cost components, in the first 
instance. The intention was that the system would be able to manoeuvre a camera to 
view any part of an object’s visible surface and, critically, higher resolution data 
would be obtained, when required, by moving the camera towards the object. The 
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target application would be reverse engineering which would place a stringent 
demand on the accuracy of the results. However it was recognised that failure to 
achieve this may mean that the system is still suitable for less demanding applications 
in computer graphics and animation. 



2 The Active Vision System 

Since it would be necessary to move a camera around an object and accurate 
positioning of the camera was required, this suggested the use of a gantry robot. A 
relatively low cost 4-axis gantry robot was sourced which formed the basis of the 
camera platform. In order to provide the system with additional flexibility a high- 
resolution turntable was designed and built. The robot’s computer control system was 
extended to incorporate control of the turntable. The initial step-up is shown 
schematically in Figure 1 . 




Fig. 1. Schematic Diagram of Gantry Robot and Turntable 

As shown in the diagram, an object to be modelled is placed on the turntable and 
viewed by a colour CCD camera mounted on the robot arm. In order to enable the 
camera to view the top of the object a fifth (pitch) axis drive was designed, built and 
integrated into the robot’s control system. The camera and pitch axis drive are shown 
in Figure 2. 
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Fig. 2. Camera and Pitch Axis Drive 

It was decided that two computers (both standard PCs) would be used to control 
the overall system, as shown in Figure 3. A colour digitiser was installed in computer 
1 and, as shown, this computer is responsible for image analysis. Computer 2 is 
assigned to viewpoint control which is implemented through a combination of camera 
and turntable movements. 

As indicated in Figure 3 the system was set up so that a sequence of predefined 
viewpoints can be specified. Computer 2 then moves the turntable and/or the camera 
to create each viewpoint in turn. Once both are in position, computer 1 is instructed to 
capture an image. It then sends a request to computer 2 to move the system to the next 
predefined viewpoint. Alternatively the next viewpoint can be computed as part of the 
image analysis process, so that the system will effectively operate in a closed loop. 
Ultimately computer 1 assembles and outputs the surface point data which is used to 
create a 3 dimensional model of the object. 

Communication was established between the two computers and software written 
to affect the two-way transmission of instructions and data. All time-critical software, 
such as the image analysis procedures, was written in Visual C-H-. Software which 
was not time-critical, including the viewpoint control program, was written in Visual 
Basic for speed of development. Software was also written to provide a Windows 
standard interface on both machines for user input and output. 



3 Constructing Initial 3 Dimensional Models 



An approach was adopted whereby the system first provides surface point data for an 
initial model using a set of predefined viewpoints. The intention was that this model 
would be used to identify regions of the surface which required further modelling and 
also, ideally, to specify the viewpoints from which this data should be obtained. 
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Fig. 3. Computer Control and Image Analysis System 



As assessment of the various techniques used to extract 3 dimensional data from 
captured images was carried out, including stereo, structured light approaches and 
shape from X. It was concluded, as others have maintained, that no technique is 
generally applicable, but that a combination of techniques can be used to compensate 
the disadvantages of individual techniques. An occluding contour approach was 
chosen as the preferred technique for generating the initial model. The obvious 
disadvantage is that regions of the object’s surface which are doubly concave cannot 
be modelled. In fact the system models both planar regions and doubly concave 
regions as planar regions. Hence it is necessary to label all regions returned as planar 
as requiring further investigation. Although not implemented to date, it is envisaged 
that the modelling of such regions will use a colour encoded structured light approach 
which uses stereo instead of triangulation [13]. In effect this means that a coloured 
structured light pattern is used to provide easily matched features in the stereo images. 
(In practice the single camera in the active vision system will be used to capture the 
two “stereo” images from laterally displaced viewpoints.) The proposed structured 
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light technique has been tested experimentally, as an adjunct to the current research 
project, and the results obtained have been encouraging. 

The occluding contour approach developed for the system operates as follows. An 
image of the object is captured from a series of predefined viewpoints. Each image is 
obtained with the object back-lit and the image is thresholded to produce a silhouette. 
The occluding contour is extracted by locating points on the edge of the silhouette 
(the visible rim). 

The workspace within which the object has been placed is assumed to be divided 
into a “stack” of horizontal planes, as shown in Figure 4(a). When an occluding 
contour is extracted it is projected onto each “workspace plane” in turn. In Figure 4(b) 
the object is a sphere and the occluding contour is a circle. When projected onto a 
horizontal workspace plane, the projection will be an ellipse. (The process can also be 
regarded as one whereby the ellipse is formed from the intersection of the cone 
emanating from the focal point of the camera, which must enclose the object, and the 
workspace plane, which must also enclose the object). As further contours are 
projected onto each plane, they will define an “enclosing area” within which the 
object must exist. This is shown in Figure 4(c), where the enclosing area will tend 
towards the circular cross-section of the sphere as further contours are projected on to 
the workspace plane. 




When a contour is projected onto a workplace plane, it is represented by a series of 
line segments between projected points on the visible rim. As shown in Figure 5(a), 
the projection of a further contour means that the intersection points between the 
contours have to be located. The intersection points are then used along with the 
appropriate projected edge points to define the enclosing area. 
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Fig. 5. Defining the Enclosing Area(s) 

A heuristic technique was devised to locate the intersection points between 
projected contours. Various eventualities had to be covered, such as the situation 
shown in Figure 5(b), and to resolve some ambiguities the handedness of each 
contour had to be recorded along with the sequence of edge points. The technique 
used was found to be robust and the choice of intersection points to define enclosing 
areas has a major advantage over the commonly used voxel approach [14] since a 
priori quantization in the horizontal plane is not required. Hence the resolution of the 
horizontal co-ordinates of the model will not be limited in advance. 

When the contours from all predefined viewpoints are projected on to the full stack 
of workspace planes, a 3 dimensional model is formed consisting of the edge and 
intersection points which define the enclosing areas. Although this approach requires 
quantization in the vertical direction, additional workspace planes and hence 
additional enclosing areas can easily be defined without the need to capture further 
images. 



4 Calibration 

A thorough calibration programme was carried out to enable the system to produce 
object surface points in world co-ordinates. For this purpose a world co-ordinate 
system was specified with its origin at the centre of the turntable. Transformations 
were then defined and evaluated between the world co-ordinate system and the 
turntable, camera and robot co-ordinate systems. The camera calibration task required 
to produce the camera/robot transformation was the most demanding step in this 
process. Camera calibration was carried out using a modified implementation of the 
method developed by Tsai [15,16]. This involved measuring various intrinsic and 
extrinsic camera parameters including the position of the camera’s principal point, 
which was located with the aid of a low intensity laser. A calibration object was also 
required and this consisted of a glass plate which was sprayed matt white and marked 
with a 14 by 20 matrix of solid black squares. The precise locations of the corners of 
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the squares were found using a travelling microscope. The Tsai coplanar method 
requires images of the calibration object to be captured from a wide range of 
viewpoints. Software was written to position the robot at a sequence of viewpoints 
and automatically capture and analyse images of the calibration plate. A red spot was 
marked on a square close to the centre of the plate, so that each square could be 
identified when the camera’s field of view covered only part of the plate. After each 
image was captured, the Sobel operator was used to locate the edges of the black 
squares, as shown in Figure 6. Linear regression was then employed to fit lines to all 
edges and the corners were located by finding the intersection points between the 
lines. A matrix manipulation package was used to process the matrices containing the 
image and real-world co-ordinates of the corner points, along with the measured 
camera parameters. This yielded coefficient values for the camera/robot 
transformation matrix. 





Fig. 6. Calibration Pattern Edge Detection 



5 Initial Results 

The system was used to produce initial 3 dimensional models of a wide range of 
objects in the form of collections of surface point co-ordinates or “data clouds”. Each 
data cloud was then passed to a software package which created a surface model, 
either by fitting triangular patches or by fitting NURBS surfaces. An example is 
shown in Figure 7 of a hemisphere which was modelled by the system and rendered 
using triangular meshes (Figure 7(a)) and NURBS surfaces (Figure 7(b)). 

An examination of the proximity of the raw data points to the NURBS surface 
model produced the graph shown in Figure 8, which shows that 90% of the data 
points are within 50 microns of the surface. 

A series of tests was undertaken to establish the success or otherwise of the 
calibration exercise. Figure 9 shows two views of a NURBS surface model of a 1 inch 
BS bolt produced by the system. The standard pitch of the thread is 2.54 mm and the 
pitch as measured from the model differs by only 15.3 microns. 

Among the more complex objects modelled by the system were a set of chess 
pieces which had been produced by a Flexible Manufacturing System installed in the 
University. Figure 10 shows 3 dimensional models of the king and the rook created by 
the active vision system and rendered, in this case, using triangular meshes. 



190 P.J. Armstrong and J. Antonis 




(a) (b) 

Fig. 7. Rendered Models of Hemisphere 




Deviation in Microns 



Fig. 8. Deviation of Data Points from Surface of NURBS Model. 

As an indication that the models produced are at least suitable for computer 
graphics applications, Figure 11 shows a scene where a surface texture has been 
added to the chess piece models. They have then been placed on a manually drawn 3 
dimensional model of a chessboard. 



6 Viewpoint Planning 

As noted previously, further modelling may be necessary for one of two reasons. 
Either a planar surface has been detected which may be concave or a region of fine 
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detail requires closer examination at a higher level of resolution. In the first instance, 
candidates are regions where the local surface curvature is low and in the second 
instance candidates are regions where the local surface curvature is high or where, in 
the extreme case, discontinuities occur. When contour intersection points on the 
perimeter of enclosing areas (see Figure 5) are examined it is evident that the local 
density of points is low when surface curvature is low and high when surface 
curvature is high. Hence it is postulated that the local density of intersection points 
can be used to identify potentially concave regions (low density) and regions 
requiring higher resolution modelling (high density). 




Fig. 9. Model of 1 inch Bolt 




Fig. 10. Models of Chess Pieces 
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Fig. 11. Model of Chessboard Scene 



In order to measure the local density of the intersection points a mask was designed 
to compute a weighted sum of the intersection points within a small region centred on 
each point in turn. The results are then filtered to identify regions where the mask 
output is either high or low. The overall process is referred to as “point density 
filtering” and its application to a range of objects suggests that it can provide the basis 
of a method for identifying the regions which require further modelling. Figure 12 
shows the results of applying point density filtering to a model of a half- cylinder 
(shown in Figure 12(a)). All points are displayed in Figure 12(b). However, when the 
filter is biased towards high density points, Figure 12(c) is produced which highlights 
the curved rear surface of the half-cylinder and the surface discontinuities at the 
vertical edges. When the filter is biased towards low density points, the planar top and 
front surfaces are highlighted. 

In practice, further modelling would be performed on those regions with the 
highest and the lowest point densities, in the first instance. In the case of the half- 
cylinder, this would mean generating high resolution data for the two vertical edges, 
which in turn would enhance the accuracy of the model. It would also mean that the 
proposed structured light facility would be applied to the front and top surfaces, since 
they both return low point densities. 
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Fig. 12. Application of Point Density Filtering 



A further procedure was devised to estimate the optimum viewpoints for 
subsequent modelling. In cases where regions requiring higher resolution data have 
been identified, the optimum viewpoint for each region is the one which places that 
region on the visible rim. This can be obtained by rotating the data point model in 
computer memory and recording the model’s orientation when the maximum number 
of intersection points from the chosen region appears on the visible rim. If orthogonal 
projection is assumed, the optimum viewing direction will in fact coincide with the 
direction of a tangent which touches the surface of the object at the centre of the 
region concerned. 

In Figure 13 a high density region has been chosen on one of the half-cylinder’s 
vertical edges. The figure shows the number of intersection points from this region 
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which appear on the visible rim, as the model is rotated through 360° in the horizontal 
plane. As the case should be, the graph indicates that the edge will remain on the 
visible rim while the viewing angle is changed through 90°. It will then reappear at 
the visible rim on the other side of the object after an absence of 90° and will remain 
there for a further 90 °. The graph provides sufficient information to select a sequence 
of viewpoints (in the horizontal plane) suitable for a close-up view of the vertical 
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i'lg. 13. Optimum Angles tor High Resolution Viewpoints 

In the case where a region of low intersection point density has been found, the 
nominal direction for projecting and viewing a structured light pattern will coincide 
with the direction of a normal to the object surface at the centre of the region 
concerned. Hence the procedure described above can be employed and the resulting 
viewing direction is then simply rotated through 90 degrees. 

In Figure 14, a low density region has been chosen on the half-cylinder’ s planar 
front surface. If the model is rotated through 360° in the horizontal plane as before, 
the graph shows the number of intersection points from this region which are visible 
when the viewing direction is normal to the surface. As should be the case, the graph 
indicates that there is a unique optimum angle which occurs when the viewing 
direction is normal to the half-cylinder’s planar front surface. The projection and 
viewing directions for the structured light facility should be chosen so that the angle 
between them is bisected by the nominal direction indicated by the graph. 
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7 Discussion of Results 

Progress to date with the active vision system has enabled a wide range of models to 
be produced for assessment purposes. The accuracy achieved suggests that the system 
would be suitable for computer graphics applications and that it may be suitable for 
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reverse engineering applications. The occluding contour technique employed does not 
place a limit, a priori, on resolution and further work will be undertaken to define the 
level of accuracy which is achievable. At present, the system is clearly only capable 
of modelling convex surfaces and the proposed method of refining models by using 
higher resolution close range images has not yet been fully implemented. However, it 
has been shown that the projected contour intersection points can be used as a basis 
for automatically identifying surface regions for further modelling and also for 
suggesting the viewpoints which should be employed. Hence viewpoints which place 
surface features and details on the visible rim for close range viewing can be 
computed and potentially concave regions can be identified and a nominal direction 
for applying structured light obtained. Further work is required, however, to fully 
develop this technique, including removal of the constraint which restricts viewing 
directions to the horizontal plane. 
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Fig. 14. Optimum (Nominal) Angle for Projecting and Viewing Structured Light. 



An objective of the current work was to develop a low cost system and no 
difficulties have been experienced with the modified robot, custom-built turntable and 
standard computing hardware employed. However, it is evident that the basic CCD 
camera incorporated in the system will limit future development and will prevent the 
system from being used for reverse engineering purposes. At present image resolution 
can only be varied within the range of the camera’s limited depth of field and initial 
results suggest that calibration accuracy deteriorates as the camera is moved towards 
the object. A more versatile camera will therefore be substituted and calibration 
repeated in advance of further development. The general principles established and 
techniques devised during the first phase of the project will, however, be retained 
since it is believed that the active vision system described in the paper can be the 
basis of a practical automated 3 dimensional modelling tool. 
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Abstract. Recent human vision research [1] suggests modelling pre- 
attentive texture segmentation by taking a set of feature samples from a 
local region on each side of a hypothesized edge, and then performing 
standard statistical tests to determine if the two samples differ significantly 
in their mean or variance. If the difference is significant at a specified level 
of confidence, a human observer will tend to pre-attentively see a texture 
edge at that location. 1 present an algorithm based upon these results, with a 
well specified decision stage and intuitive, easily fit parameters. Previous 
models of pre-attentive texture segmentation have poorly specified decision 
stages, more unknown free parameters, and in some cases incorrectly model 
human performance. The algorithm uses heuristics for guessing the 
orientation of a texture edge at a given location, thus improving 
computational efficiency by performing the statistical tests at only one 
orientation for each spatial location. 



1 Pre-attentive Texture Segmentation 

Pre-attentive texture segmentation refers to the phenomenon in human vision in 
which two regions of texture quickly (i.e. in less than 250 ms), and effortlessly 
segregate. Observers may perceive a boundary or edge between the two regions. 

In computer vision, we would like to find semantically meaningful boundaries 
between different textures. One way of estimating these boundaries is to find 
boundaries that would be found by a human observer. The boundaries thus defined 
should be sufficient for most computer vision applications. Whether a human 
observer can distinguish two textures depends upon whether the discrimination is pre- 
attentive or attentive. The experimental literature tells us far more about pre-attentive 
segmentation than attentive discrimination. 

Researchers have suggested both feature- and filter-based models of pre-attentive 
texture segmentation. Many of the feature-based models have been statistical in 
nature. Julesz [2] suggested that pre-attentive segmentation is determined by differ- 
ences in the 2nd-order statistics of the texture, or differences in the Ist-order statistics 
of "textons" such as line terminators and corners [3]. Beck, Prazdny, & Rosenfeld [4] 
suggested that texture segmentation is based upon differences in the first-order 
statistics of stimulus features such as orientation, size, and contrast. However, these 
theories do not indicate how these differences might be quantified, or what properties 
of the statistics might be used. Furthermore, such models have not typically been 
implemented such that they could be tested on actual images. 

Filter-based models [e.g. 5, 6, 7, 8] have suggested that texture segmentation is de- 
termined by the responses of spatial-frequency channels, where the channels contain 
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both linear filtering mechanisms and various non-linearities. Malik & Perona’s model 
[7] provides the most developed example of this type of model. It involves linear 
bandpass filtering, followed by half-wave rectification, non-linear inhibition and 
excitation among channels and among neighboring spatial locations, filtering with 
large-scale Gaussian first derivative filters, and a decision based upon the maximum 
response from the final filtering stage. 

These models often contain many unknown parameters. What weights specify the 
inhibition and excitation among the different filter responses? What scale should one 
use for the Gaussian 1st- derivative filters? Perhaps most importantly, such models 
often contain an arbitrary, unspecified, threshold for determining the existence of a 
perceived edge between two textures. Many of these filter-based models are 
notoriously vague about the final decision stage. Furthermore, such models don’t 
give us much insight into which textures will segment, since the comparison carried 
out by the model is often obscured by the details of the filtering, non-linearities, and 
image-based decision stage. What is the meaning of the texture gradient computed in 
the penultimate stage of the Malik & Perona model? 

This paper describes a working texture segmentation algorithm that mimics human 
pre-attentive texture segmentation. Section 2 reviews recent human vision 
experiments [1], which were aimed at studying what first-order statistics determine 
texture segmentation. These results suggest that modelling pre-attentive texture 
segmentation by standard statistical tests for a difference in mean and standard 
deviation of various features such as orientation and contrast. Section 3 reviews 
previous models of pre-attentive texture segmentation in light of these results, and 
discusses the relationship to other edge detection and image segmentation algorithms. 
Section 4, presents a biologically plausible, filter-based algorithm based upon the 
experimental results in Section 2 and those of Kingdom & Keeble [9]. Section 5 
presents results of this algorithm on artificial and natural images. 

2 Recent Experimental Results in Pre-attentive Texture 
Segmentation 

In [1], I studied segmentation of orientation-defined textures such as those shown in 
Figure 1. In each of the three experiments, observers viewed each texture pair for 250 
ms, and the task was to indicate whether the boundary between the two textures fell to 
the left or right of the center of the display. 

If, as Beck et al [4] suggested, texture segmentation of orientation-defined textures is 
based upon differences in the Ist-order statistics of orientation, to what Ist-order 
statistics does this refer? If the difference in mean orientation is the crucial quantity, 
two textures should segment if this difference lies above a certain threshold, 
independent of other properties of the orientation distributions. A more plausible 
possibility is that the determining factor is the significance of the difference in mean 
orientations. The significance of the difference takes into account the variability of 
the textures, so that two homogeneous textures with means differing by 30 degrees 
may segment, while two heterogeneous textures with the same difference in mean may 
not. Perhaps observers can also segment two textures that differ only in their 
variability. Other parameters of the distribution might also be relevant, such as the 
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(a) fb) (c) (d) 




(e) (f) (g) (h) 

Figure 1 : Three orientation-defined textures (a, b, c, d) and the significant edges found by 
our algorithm (e, f, g, h). The “strength” gives the average value of the test statistic at all 
locations for which the edge was significant. For (e, f, g), compare this with 2.04, the 
82% threshold for a difference in mean orientation. For (h), compare the strength with 
2.5, the 82% threshold for a difference in orientation variability. 

skew or kurtosis. Alternatively, observers might be able to segment two textures 
given any sufficiently large difference in their first-order statistics. 

The first experiment asked observers to segment two textures that differed only in 
their mean orientation. Each texture had orientations drawn from a wrapped normal 
distribution [10]. The experiment determined the threshold difference in mean orienta- 
tion, at which observers can correctly localize the texture boundary 82% of the time, 
for 4 different values of the orientation standard deviation. Figure 2a shows the 
results. Clearly observers can segment two textures differing only in their mean 
orientation. Furthermore, the difference in mean required to perform the segmentation 
task depends upon the standard deviation. 

The second experiment determined whether or not observers could pre-attentively 
segment textures that differed only in the variance of their orientation distributions. 
For two possible baseline standard deviations, the experiment measured the threshold 
increment in standard deviation at which observers could correctly localize the texture 
boundary 82% of the time. Observers could segment textures differing only in their 
variance, and Figure 2b shows the thresholds found. The difference in variance 
required depends upon the baseline standard deviation. 

The third experiment tested segmentation of a unimodal wrapped-normal distribution 
from a discrete, bimodal distribution with the same mean orientation and variance. 
This experiment measured percent correct performance, for 4 possible spacings of the 
modes of the bimodal distribution. The results are shown in Figure 2c. All observers 
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Figure 2: Results of psychophysical experiments and modelling. Symbols plus error bars 
indicate data; curves indicate the fit of the model to that data. (See text.) In (a) and (b) data 
for the 3 observers and best fit curves have been shifted horizontally to facilitate viewing. 

performed well below the 82% correct required in Experiments 1 and 2, with only 
RER (the author) performing significantly above chance. These results do not rule 
out the possibility that observers may segment textures differing in statistics other 
than their mean and variance. However, the inability to segment textures that differ so 
greatly suggests that observers do not make use of the full first-order statistics in 
performing this task. 

These results suggest a model of pre-attentive texture segmentation of orientation- 
defined textures. Eigure 3 depicts the stages of this model. The observer first extracts 
noisy estimates of orientation, with the internal noise distributed according to a 




noise 

(standard deviation s ) 



Figure 3: Diagram of our computational model of texture segmentation. 
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wrapped normal distribution, with some standard deviation s. The observer then 
collects n orientation estimates from each side of a hypothesized edge. If the two sets 
of samples differ significantly, at the a=0.82 confidence level, in either their mean 
orientation or variance, then the observer sees a boundary. As implemented, this 
model uses the standard Watson-Williams test for a significant difference in mean 
orientation, and the Watson-Williams and Mardia test for a significant difference in 
orientation variance [10]. Section 4 discusses these tests in more detail. 

This model, then, has two free parameters: the internal noise standard deviation, s, 
and the number of samples, n. The curves Figure 2a-b show the fit of this model to 
the experimental data, and the legend indicates the hest-fit values of these parameters. 
For any given observer, this model provides a good fit of the thresholds for 
segmentation both of textures differing only in mean orientation, and textures differing 
only in orientation variance. Furthermore, the fit to the data of each of the three 
observers yields roughly the same value for the number of samples, n, suggesting that 
the size of the integration region may not vary much across observers. In modelling 
experimental results on homogeneous textures such as those used in these 
experiments, the n samples may be taken from anywhere in the texture. However, 
previous work suggests that the visual system performs local texture segmentation 
computations, with samples taken from regions adjacent to the hypothesized edge [see, 
e.g., 8, 1 1]. The fit to the data of the two less experienced subjects, DHM and VG, 
yields roughly the same value for the internal noise parameter, s. The fit to the data 
of experienced observer RER yields a lower value for this parameter, consistent with 
evidence that learning, for low-level perceptual tasks such as this one, may he 
mediated hy a reduction of internal noise [12]. 

It appears, then, that for these orientation-defined textures, a good model for 
segmentation extracts noisy orientation estimates, then tests for a significant 
difference between the distributions of estimates using standard parametric statistical 
tests. For the purposes of this paper, I assume that a similar model describes 
segmentation based upon features besides orientation, such as contrast, color, etc. As 
of yet there is little experimental evidence addressing this point. Section 4 presents a 
texture segmentation algorithm based upon this model, using orientation and contrast 
as the texture features. 

3 Previous Work: Texture Segmentation Models and Algorithms 

To summarize the experimental results discussed in the previous section, a texture 
segmentation algorithm should be able to segment textures differing in mean 
orientation or in orientation variability. The segmentation difficulty should increase 
with increasing variability, in a way described by standard statistical tests. The 
segmentation algorithm should not find an edge when two textures have the same 
mean and variance, yet one is bimodally distributed and the other is unimodal. 

Most filter-based models of pre-attentive texture segmentation will find an edge 
when the two textures differ in their mean orientation. Though these algorithms do 
not explicitly perform a statistical test for the difference in mean orientation, many of 
them will replicate the findings that edges become weaker as the variance increases 
[e.g. 7, 8]. It remains to be seen whether performance will degrade in a way that 
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matches the experimental findings. If it does not, then these algorithms require more 
than one threshold — the threshold would need to vary with orientation variance. 
Furthermore, my suggested model has two intuitive parameters, s and n, easily 
determined by a fit to the data. To the extent that previous models have implicitly 
performed a sort of statistical test, it may prove difficult to fit such models to the data. 
Space concerns prohibit our displaying results from Bergen/Landy [8] and 
Malik/Perona [7]. However, in our experience, as the variability of the texture 
increases, both algorithms find a large number of spurious edges, and Bergen/Landy 
may generate spurious edges at a finer scale than the true edge. Both models fail to 
find an edge when the textures differ only in variance. 

A number of standard clustering techniques will cluster in such a way as to 
maximize the difference in mean relative to the variability of each cluster, or some 
other standard statistical measure [ 13, 14, 15]. However, clustering techniques require 
that the user specify the number of clusters. My edge detection algorithm will test for 
edges without knowledge of the number of edges. 

Ruzon & Tomasi [ 16] perform color edge detection using the Earth Mover’s Dist- 
ance (EMD) as a measure of the difference between two distributions of color samples. 
This work uses a more complicated statistical test for an edge, like the algorithm 
presented here - this paper does not analyze the relationship between their EMD and 
standard statistical tests. Voorhees (& Poggio [ 17] also use a more complicated 
distance measure between two textures, extracting blobs from the textures, then using 
a non-parametric statistical test to compare the textures. However, not all textures 
lend themselves to easy blobs extraction. Eurthermore, the experimental results 
discussed in Section 2 show that the non-parametric test in [17] will find edges where 
human observers do not pre-attentively perceive them [1], a criticism that also holds 
for the measure of [14]. 

At each location, Ruzon & Tomasi test for an edge in a number of different 
directions. This requires a great deal of computation. Section 4.5 presents heuristics 
for testing for an edge at only one orientation. 

Elder [18] and Marimont & Rubner [19], working in standard, luminance-based edge 
detection, perform a statistical test for whether an edge exists at a given location, but 
they use a global measure of the variability, and thus will yield unexpected results 
when the variance changes over the image. 

Some of the closest previous work comes from Eesharki & Hellestrand [20], who 
perform edge detection by using a Student’s t-test to test for a significant difference in 
mean luminance. Similarly, Weber & Malik [21] test for motion boundaries, by 
using standard parametric statistical tests. 

4 The Texture Segmentation Algorithm 

4.1 Statistical Tests for a Significant Difference in Mean or Spread 

My texture segmentation algorithm declares the presence of an edge if certain basic 
texture features differ significantly in their mean or spread. For the purposes of this 
paper, the algorithm extracts the features of orientation and contrast. This subsection 
describes the statistical tests used by the algorithm for these two kinds of features. 
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Testing for a Difference in Mean Orientation. Given two samples of 
orientation estimates, the Watson-Williams test [10] indicates whether the mean 
orientations of the two samples differ significantly. Assume the two independent 
random samples of the same size, n, and denote them and The 

Watson-Williams test assumes two samples drawn from von Mises distributions with 
the same concentration parameter , K. The von Mises distribution for directional 
data has many of the nice features of the normal distribution for linear data. It is quite 
similar to the wrapped normal distribution used in Section 2, and for many purposes 
they may be considered equivalent [10]. The higher the concentration parameter, the 
more concentrated the distribution about the mean orientation - i.e. the lower the 
spread of the distribution. The Watson-Williams test is ideally for K larger than 2 — 
roughly equivalent to an angular standard deviation of less than 22 degrees. 

We first compute the components of the mean resultant vectors: 

^=-Xcos(20,), 5, = -Xsin{20,), ^. = -Xcos(2v/,), 5, = -Xsin{2v/,) 

n i n i n i n t 

From this, we compute the length of the yth mean resultant vector: 

R^ = JC^+S^ 

J \ J J 

We also compute the length of the resultant vector for the combined data sample: 

R = v]c^+S\ where C = (q + Q)/2 and S = (S,+S,)/2 
The test statistic is 






= 2 




(n-1) 






( 1 ) 



where ic is the value of the concentration parameter estimated from the two samples. 

Under the null hypothesis of no edge, is approximately distributed according to 
an F distribution, tabulated in standard statistics books. If F^ is larger than the 
tabulated value for significance level a and degrees of freedom (1, 2n-2), the difference 
in mean orientation is significant at level a. Throughout this paper, a=0.82. 

Testing for a Difference in Orientation Variability. The Watson & 
Williams test, modified by Mardia [see 10], tests for a difference in concentration 
parameter. Assuming two samples, both of size n, drawn from von Mises 
distributions, with the mean resultant length of the combined sample greater than 0.70 
(equivalently, an angular deviation of less than 22 degrees), the test statistic is: 



(l-g.) 



( 2 ) 



or 1/F„ whichever is >1. Under the null hypothesis that the two samples have the 
same concentration, F^ is approximately distributed according to an F distribution. To 
test for an edge of a given significance, a, one compares F^ with the value in the a/2 
F distribution table, with degrees of freedom («-l, n-\). If the statistic is larger than 
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the value from the table, the difference in concentration parameter is statistically 
significant. 

Testing for a Difference in Mean Contrast. The standard Student’s t-test 
tests for a significant difference in mean for linear variables such as contrast. 
Assuming two samples, both of size n, drawn from normal distributions with 
unknown but equal variance, the test statistic is 

t = (x,-x^)^/^s^ + sl (3) 

where x. and s are the yth sample mean and standard deviation, respectively. Under 
the null hypothesis that the two samples are drawn from the same distribution, this 
statistic is distributed according to a Student’s t distribution, tabulated in standard 
statistics books. If the absolute value of t is larger thanf„^ from these tables, the 
difference in mean is significant at significance level a. 

Testing for a Difference in Contrast Variance. To test for a significant 
difference in variance, for a linear variable such as contrast, one uses a standard F test. 
Again assuming two samples, each of size n, drawn from normal distributions, the 
test statistic is 




or l/F^ whichever is greater than 1. Under the null hypothesis that the two samples 
are drawn from distributions with the same variance, this statistic is once again 
distributed according to an F distribution. For a given significance, a, the difference 
in variance is significant if F„ is larger than the value in the statistics table for this 
significance, with degrees of freedom («-l, n-l). 

General Comments. Note that the two tests for an orientation edge may also be 
used for any texture feature that is circularly distributed, i.e. for which a value k is 
equivalent to a value k mod m, for some m. The two tests for a contrast edge may be 
used for any one-dimensional, linearly distributed texture feature. Similar tests exist 
for multi-dimensional data. 

Before going on to describe the rest of the texture segmentation algorithm, consider 
what processing occurs in each of these four statistical tests. In each case, computing 
the statistic involves first taking the average of one or more functions of the feature 
estimates. For example, the orientation tests require first computing the mean 
resultant vectors, by taking the average of the cosine and sine of twice the orientation. 
In all 4 cases, the final test statistic involves some function of these averages. This 
suggests that an algorithm that computes these test statistics will involve first (1) 
computing a set of feature images, then (2) integrating some function(s) of these 
images over some integration region, and finally (3) computing the test statistics and 
testing for the presence of an edge. These stages correspond to those of the model 
depicted in Figure 3. Additional details of these stages are given below. 
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4.2 Extract Texture Features 

For the purposes of this paper, my algorithm extracts the features of orientation and 
contrast. The use of contrast as a feature deserves some discussion. Several 
researchers [6, 7] have argued that segmentation of texture patterns such as those 
studied by Julesz [2] is determined by the "contrast" features extracted by center- 
surround filters such as Difference-of-Gaussian (DOG) filters. Others [3, 22] have 
suggested that one must extract more complicated features such as junctions and line 
endings. This remains an unresolved issue, but for many examples there is little 
difference between these two approaches. In addition, my algorithm assumes that 
segmentation of contrast-defined textures is determined by the significance of the 
differences between the mean and standard deviation of the contrast, as with 
orientation. This has yet to be demonstrated experimentally. Finally, the fit of the 
model to the orientation data from [1] yields estimates of the various model 
parameters, and the same would need to be done for other texture features, if one 
wishes to mimic human performance. My philosophy here is to extrapolate the 
orientation segmentation model to contrast textures, test it on various images, and 
look forward to additional psychophysics to resolve some of the above issues. 

The algorithm extracts the features of orientation and contrast using biologically 
plausible spatial-frequency channel operations. For orientation, this amounts to using 
steerable filtering [23] to extract estimates of orientation. Steerable filtering produces 
two images at each scale, representing cos(20) and sin(20), where 9 is the orientation, 
locally. These are precisely the functions of orientation one needs in order to compute 
the mean resultant vectors required by the statistical tests in Eqns. 1 and 2. For 
contrast, we follow [7], filtering with first- and second-derivative DOG filters, 
followed by half-wave rectification, and a local max operation. The local max is 
intended to give a single, phase-invariant measure of local contrast. 

The algorithm extracts orientation and contrast at a number of different scales. For 
orientation, we use the oriented Gaussian pyramid from [8]. Thus the scales for 
orientation filtering differ by a factor of two. For contrast, we use the more densely 
sampled scales of [7]. 

The algorithm processes each scale and feature independently. As necessary, it adds 
noise to the feature estimates, so as to match the internal noise parameter in fits to 
human data. The orientation estimation procedure contains inherent noise equivalent 
to the internal noise for the two more naive observers in Figure 2, so the algorithm 
requires no added noise. For contrast, we have no fit to human data, and have 
experimented with a number of possible amounts of added noise. In the examples 
shown here, use an added noise with variance a^=I50. 

4.3 Collect Samples of Feature Estimates and Compute the Relevant 
Statistics 

Next, the algorithm hypothesizes that an edge, with a particular orientation, exists at a 
particular location in the image. It then tests this hypothesis by running the various 
statistical tests described above. 

As mentioned in the discussion of these statistical tests, the first step in calculating 
each of the statistics is to compute averages of various functions of the feature values. 
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Therefore, the step in the model in which the observer collects n feature samples from 
each side of a hypothesized edge is equivalent to integrating some functions of the 
feature values over a local integration region. The algorithm integrates using a 
Gaussian window. 

Previous work by Kingdom & Keeble [9] suggests that, for orientation, the size of 
the integration region is a constant multiple of the width of the oriented lines that 
make up the texture. Alternatively, one may think of the region size as being a 
constant multiple of the support of the filters used to extract orientation estimates. 
For finer features, the region is proportionally smaller than for coarser features. Once 
again, the algorithm described here assumes that the same principle holds for features 
other than orientation. 

The fit to the experimental data in Figure 2 indicated that human observers seemed 
to collect orientation estimates from n=9 elements on each side of the hypothesized 
edge. Based upon the width and spacing of the line elements used in the experimental 
displays, this implies that human observers collect samples from a circular region of 
diameter approximately 5 times the support of their oriented filters. The algorithm 
uses an integration region for contrast features of a similar size relative to the center- 
surround filters used to extract contrast. 

4.4 Combine Results into a Single Edge Map 

The results of the previous steps give us an image for each scale, feature type, and 
statistical test (mean and spread). Each image indicates, for each location, the presence 
or absence of a significant edge of the given type. The value of the statistic gives an 
indication of the strength of the edge. One could stop at this stage, or there are 
various things one could do to clean up these edge maps and combine them into a 
single map. "False" variance edges tend to occur next to edges due to a difference in 
mean, since a region which spans the edge will have a higher variance than 
neighboring regions which include only one texture. Therefore, the algorithm inhibits 
variance edges near significant difference-in-mean edges. It combines edge images 
across scales using a straightforward "or" operation. If an edge is significant at any 
scale, it is significant, regardless of what happens at any other scale. Finally, texture 
edges tend to be more poorly localized than luminance edges, due to their statistical 
nature. Texture edges, in the examples in this paper, are typically statistically 
significant in a band about 2 texels wide. One can "thin" edges by finding the 
maximum of the test statistic in the direction of the edge, as done for the results 
presented here. 

4.5 How to Hypothesize an Edge 

Edge detection methods that involve taking the gradient of the mean of some feature 
value [e.g. 24, 7, and many others] have the advantage that they are "steerable" [23]. 
This means that such methods can take only a horizontal and a vertical derivative, and 
infer the direction in which the derivative - and thus the edge strength - is maximized. 
Edge detection methods that involve more complex comparisons of the features on 
each side of the edge [e.g. 16] typically require that one check each possible 
orientation for the presence of an edge. This process is time consuming. Thus in this 
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(a) (b) 



Figure 4: Relationship between mean resultant vectors and heuristics for guessing edge 

direction (see text). 

subsection I ask whether one can make a good guess as to the most likely orientation 
of an edge at each location, and test only that edge orientation. 

This subsection presents a set of heuristics for guessing the edge orientation, one 
for each statistical test s. For each s, one would like to test only one orientation for 
an edge at each location, yet find edges at the same locations as if one had tested all 
possible orientations. Therefore, for each s, one desires a steerable function,/, of the 
feature values, such that if there exists an orientation such that test s indicates a 
significant edge, test s will indicate a significant edge at the orientation that 
maximizes f,.. If these conditions hold, one may easily find the orientation, 0^^^, that 
maximizes /„ and be assured that if test s would find a significant edge at that 
location, it will find one at orientation 

For each test, s, I present a function f that is the Gaussian 1 st-derivative of some 
function of the feature values. The maximizing orientation of such / is given by the 
direction of the gradient. The heuristics presented produced the desired behaviour over 
99% of the time in Monte Carlo simulations. Thus these heuristics provide a reliable 
way of avoiding the computational complexity of testing each possible edge 
orientation at each spatial location. 

Heuristic for a Mean-Orientation Edge. As mentioned above, the orientation 
estimation stage returns two images for each scale, representing cos(20) and sin(20), 
where 0 is the local orientation estimate at that scale. First, rotate all orientation 
estimates by angle (3, such that the mean resultant vector for the combined sample has 
an orientation of 0°. This generates the image: 

sin(20 - P) = sin(20) cos(p) - cos(20) sin(p) 

Next, take the Gaussian 1 st-derivative of this image in the horizontal and vertical 
directions. The direction of the gradient provides a guess for the edge direction. The 
intuition follows: 

Figure 4a shows example resultant vectors, rotated to the canonical position, for the 
case in which the two distributions have the same concentration but different mean 
orientations. The Gaussian 1 st-derivative computes essentially the difference between 
the y-components of the two resultant vectors. The length of the resultant vector 
gives a measure of the concentration of the distribution. The orientation of the 
resultant vectors indicates an estimate of the mean orientations of the underlying 
distributions. For given lengths of the two mean resultant vectors, R^ and R^, the 
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vertical distance between their tips will be larger the larger the difference in angle 
between them, and for a given difference in angle, the vertical distance will be smaller 
the shorter their lengths. The same is true of the test statistic, F^, thus the intuition 
that this heuristic should give us a reasonable guess for the direction of the edge. 

Heuristic for an Orientation- Variability Edge. Again rotate the orientation 
estimates such that the resultant vector for the combined sample has an orientation of 
0°. This time, compute the image: 

cos(20 - P) = cos(20) cos(j8) -I- sin(20) sin(j3) 

Again, take Gaussian Ist-derivatives of this image. The direction of the gradient 
provides the guess for the edge direction. The intuition follows: 

Figure 4b shows example resultant vectors for the case in which the underlying 
distributions differ only in their concentration parameter. The horizontal distance 
between the tips of these resultant vectors gives a measure of the difference between 
the two concentration parameters. The test statistic (see Eqn. 2) takes a ratio of the 
resultant vector lengths, as opposed to a difference, but the difference serves well when 
it comes to guessing the direction of the edge. 

Heuristic for a Mean-Contrast Edge. Here we steer the Gaussian 1st- 
derivative of contrast. The direction of the gradient, which gives the direction with the 
largest change in mean contrast, provides the guess for the edge direction. 

In this case, one can make a stronger statement: The direction of the gradient of 
mean contrast indicates the direction in which the test statistic reaches its maximum. 
Thus the direction of the gradient always provides the best guess for the edge direction. 
Recall that the test statistic for a difference in mean contrast is: 

t = {x^- )4n / ^ + si 

The gradient indicates the direction in which (x^ -x^) is maximized. The only way 
that t could reach a maximum in a different direction is if that direction reduced the 
denominator more than the numerator. It is a simple matter of algebra to show that 
an edge direction with smaller (x^ - x^) in fact also yields a larger + s\ , thus 
reducing the test statistic. Due to space concerns, we do not reproduce the proof here. 
The intuition follows: Suppose that in the direction of the gradient, x^>x^. When 
the direction of the hypothesized edge changes, m samples from the first set transfer to 
the second, and vice versa. Since the difference between the two means decreases, the 
change in direction adds, on average, smaller elements to the 1st set, and takes away 
larger elements. But such a manipulation will increase the sum of the variances of the 
sets, since it adds small elements to the set with the larger mean, and vice versa. 
Heuristic for a Contrast- Variance Edge. First calculate the mean, p, of both 
sample sets. Then compute the image 

(T - 

Again, steer the Gaussian 1 st-derivative of this function, and the direction of its 
maximum gives us our guess for the edge direction. 

This heuristic estimates the variance of each sample set, and uses the difference 
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between these variances as a measure of the strength of the edge in a given direction. 
The test statistic takes a ratio of the two variance estimates, as opposed to a difference, 
but the difference serves well when it comes to guessing the direction of the edge, and 
is easily steered. 

5 Results and Discussion 

All of the examples in this paper used the same version of the algorithm, with 
parameters set as described above. Figure 1 shows the results on four images much 
like those used in the experiments described in Section 2. The first three texture pairs 
differ in mean orientation by 25°. Low variability textures, as in Figure la, allow 
observers to localize the boundary well over 82% of the time. In Figure lb, a larger 
orientation variance makes the difference in mean just above the 82% correct 
threshold. In Figure Ic, the increased variance makes the difference in mean well 
below threshold - observers would have great difficulty localizing this edge. 

Figures le-lg show the results of our algorithm on these three texture pairs. The 
results will often show a number of edges laid on top of each other, when edges are 
found at a number of different scales or by more than one of the 4 statistical tests. 
The algorithm correctly finds a strong boundary in the first image, a weaker one in the 
second image, and essentially no edge in the third. 

Figure Id shows an image in which the two textures differ only in the spread of 
their orientations. This difference is above the thresholds found in Experiment 2, and 
it should be possible to see the boundary. Our algorithm finds the boundary, as 
shown in Figure Ih. 

Many texture segmentation experiments and theories have revolved around images 
like those in Figure 5. Malik & Perona [7] tested their algorithm on such images, 
and used it to predict the segmentability of these texture pairs. Their predictions 
agreed with the experimental data of Gurnsey & Browse [25]. (Malik & Perona also 
compared the results of their algorithm with those of Krose [26]. However, Krose 
studied a visual search task, in which observers looked for an “odd man out.” It is 
inappropriate to compare those results with results of segmentation of images such as 
those in Figure 5, for reasons given in [25].) The results of my algorithm, shown in 
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Figure 5: Four texture pairs from [7], with the edges found by our algorithm. The strength 
gives the average value of the mean-contrast test statistic for each edge, and should be 
compared with the threshold value, 1.42. Only mean-contrast edges were significant. 
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Figure 6: Patchwork of natural textures (a), along with the contrast (b) and orientation (c) 

edges found by our algorithm. 




(a) (b) 



Figure 7: Results on a natural image. 

Figure 5, agree with those of [25]. 

Figure 6a shows an array of real textures, stitched together. Note that the true 
boundaries between the textures are not straight. The algorithm does quite well at ex- 
tracting contrast edges (Figure 6b), often closely following the boundaries. It per- 
forms less well near corners, where assumptions about the number of textures present 
in a region are violated. Few of the textures in this image have strong orientation, 
providing a challenge to finding orientation edges, yet we still see reasonable results, 
as shown in Figure 6c. The algorithm predicts no pre-attentive segmentation for 
texture pairs with no edge shown between them. Figure 7 shows a highly oriented 
natural image, for which the algorithm extracts strong orientation edges. 

This paper has presented an algorithm for mimicking pre-attentive texture 
segmentation. The algorithm bridges the gap between earlier statistical and filter- 
based models. It has essentially two intuitive key parameters, the internal noise in the 
feature estimates and the integration scale, and obtains values of those parameters from 
fits to experimental data in [1, 9]. Performing tests for an edge more complicated than 
a gradient has traditionally led to increased computational cost due to a need to test for 
an edge at multiple orientations. The heuristics presented here essentially allow the 
algorithm’s statistical tests to be “steered.” 
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Abstract. We propose a new framework for calibrating parameters of 
energy functionals, as used in image analysis. The method learns pa- 
rameters from a family of correct examples, and given a probabilistic 
construct for generating wrong examples from correct ones. We intro- 
duce a measure of frustration to penalize cases in which wrong responses 
are preferred to correct ones, and we design a stochastic gradient algo- 
rithm which converges to parameters which minimize this measure of 
frustration. We also present a first set of experiments in this context, 
and introduce extensions to deal with data-dependent energies. 

keywords: Learning, variational method, parameter estimation, image 
reconstruction, Bayesian image models 



1 Description of the Method 

Many problems in computer vision are addressed through the minimization of 
a cost functional U. This function is typically defined on a large, finite, set 
17 (for example the set of pictures with fixed dimensions), and the minimizer 
of a; I— >■ U{x) is supposed to conciliates several properties which are generally 
antithetic. 

Indeed, the energy is usually designed as a combination of several terms, each 
of them corresponding to a precise property which must be satisfied by the opti- 
mal solution. As an example among many others, let us quote probably the most 
studied cost functional in computer vision, namely the Mumford/Shah energy 
(cf. 0), which is used to segment and smooth an observed picture. Expressed 
in a continuous setting, it is the combination of three terms, one which ensures 
that the smoothed picture x, defined on a set D C is not too different from 
the observed one another which states that the derivative of the smoothed 
picture is small, except, possibly, on a discontinuity set A, and a last one which 
ensures that the discontinuity set has small length. These terms are weighted by 
parameters, yielding an energy function of the kind 

U(x)= j {^{s) — x{s))'^ds + a I \V sx\ * 2ds + /3^{{A) (1) 

Jd Jd\a 

D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. 212- 12^ 2000. 

(c) Springer-Verlag Berlin Heidelberg 2000 
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where is Haussdorf measure of the discontinuity set. 

In this paper, we consider cost functionals of the kind 

d 

U{x) = Uo{x) + y^9iUj{x) 

i=l 

where the 6i are positive parameters. Whatever vision task this functional is 
dedicated to (restoration, segmentation, edge detection, matching, pattern reco- 
gnition, ... ), it is acknowledged that variations in the values of the parameters 
have significant effects on the qualitative properties of the minimizer. Very often, 
these parameters are fixed by trial and error, while experimenting the optimiza- 
tion algorithm. We here propose a systematic way for tuning them, based on a 
learning procedure. 

The method is reminiscent to the qualitative box estimation procedure which 
has been introduced by Azencott in . It relies on some a priori knowledge which 
is available to the designer. The basic information can be expressed under the 
statement: For some configurations x and y in fl, one should have U{x) > U{y). 
In other terms, y is a “better” solution than x. 

When this is known for a number of pairs of configurations, {{xk,yk), h = 
1, . . . , N}, we get a system of constraints which take the form, for k = 1, . . . , N: 

d 

Uoiyk) - Uoixk) + UU^{yk) - U,{xk)) < 0 

If we let e = (01, . . . ,0d), = Ui{yk)-Ui{xk), and At = {Aki, ■ ■ ■ ,Akd), this 

can be written 

Aok + {(^ ) Ak) < 0, fc = 1, . . . , iV , 

(. , .) being the usual inner product on . 

Solving such a system of linear inequalities can be performed by a standard 
simplex algorithm. However, when the system has no solution (which is likely to 
occur if there are many inequalities, and/or if they are deduced for the obser- 
vation of noisy real data), it is difficult to infer from the simplex method which 
parameter should be selected. We thus define a new cost functional in the pa- 
rameters, or measure of frustration, which is large when the inequalities are not 
satisfied: denote by a~^ the positive part of a real number a, and set 

N 

Fo{9) = Y,{^ok + {e, Ak)] + 

k=l 

It is practically more convenient to use a smooth approximation of this fun- 
ction, so that we let, for A > 0 

1 " 

(^) = 2 ^ 
k^l 

with q\{a) = Alog (e^ -I- e“^) +a. Given properly selected examples, the mini- 
mization of F\ is the core of our estimation procedure. We therefore study some 
related properties. 
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2 Properties of the Function F\ 

Proposition 1. For all X > 0, F\ is a convex function of 9. Moreover, 
lim Fx{9) = Fq{9). 

A-S-0+ 

This is more or less obvious and left to the reader. Let us, however, write down 
the derivatives of F\, for A > 0, since they will be used in the sequel (recall 
that the first derivative is a vector and the second derivative a, d x d symmetric 
matrix). One has: 



N 



Fm = I 1 + tanh 















(2) 



F'm 




1 

A 



(Liofc + {9 , 




Life. ‘Life 



(3) 



Denote by the covariance matrix of the Z\fc, namely 

Proposition 2. The matrix E/^ is positive definite if and only if, for all A > 0, 
the function F\ is strictly convex, and if and only if, for some A > 0, the function 
F\ is strictly convex 

Proof. If, for some A > 0, and for some 9, F”{9) is not definite positive, there 
exists a vector u G ]R‘^ such that *u.F'f {9).u = 0. But one has 



1 ^ 

^u.F”{9).u = — ^ ( 1 — tanh^ 

^ k=l 



{xXok + {9 , X\k)) 



{u, AkY 



and this expression can vanish only if, for all k, {u, AY) = 0, but this implies 
that *uEau = 0 so that E/^ cannot be definite. 

Conversely, if E/s, is not positive, one shows similarly that there exists u such 
that {u , Ak) = 0 for all k, but this implies that, for any A > 0, for any 9 and 
any t G M, F\{9 + tu) = F\{9) so that F\ cannot be strictly convex. 

Thus, non convexity is equivalent to the existence of a fixed linear relation among 
Aki, • • • , Akd- 

We now address the question of the existence of a minimum of F\. We assume 
A > 0 and strict convexity, ie E/s > 0. The convex function F\ has no minimum 
if and only if it has a direction of recession, ie. if and only if there exists a 
vector u G such that, for all 9, t i-G F\{9 + tu) is decreasing. By studying 
the derivative of this function, we can show that, in order to have a direction of 
recession, there must exist some u such that (Z\fe , m) < 0 for all A:, with a strict 
inequality for some k in order to have strict convexity. If u provides a direction 
of recession, then t.u will be a solution of the original set of inequalities as soon 
as t is large enough. This is a very unconvenient feature, since, in particular, it 
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will completely cancel out the role of C/q- Such a situation is in fact caused a 
lack of information in the original set of examples {x^, y^) = 1, . . . , N , in the 

sense that this set fails to provide situations in which the role of Uq has some 
impact. 

3 Learning from Examples 

3.1 Objective Function from Small Variations 

We now provide a framework in which this simple technique can be applied when 
some examples of “correct configurations” are available. They may come, either 
from simulated, synthetic data, or from real data which have been processed by 
an expert. The idea is to generate random perturbations of the correct confi- 
gurations and to estimate the parameters so that the perturbed configurations 
have a higher energy than the correct ones. 

Let us first assume, that a single configuration yo is provided. Our goal is 
thus to design the parameters so that j/o will be, in some local sense, a mi- 
nimizer of the energy. The key of the learning process is to define a process 
which generates random perturbations of a given configuration. This process of 
course depends on the application, and should provide a sufficiently large range 
of new configurations from the initial one. Formally, it will be associated to a 
transition probability P(jjo, ■) on Q, which will produce variations of the correct 
configuration j/g. Assume this is done K times independently, and that a sample 
xi, . . . , xk has been drawn from this probability. From the fact that yo is a good 
configuration, we assume that, for all k, U{yo) — U{xk) < 0. Slightly changing 
the notation, define A{yo,x) to be the vector composed with the Ui{yo) — Ui{x) 
for i = 1, . . . ,d and h{yo,x) = Uo{yo) ~ Uo{x). The previous method leads to 
minimize 

1 ^ 

F\{^) = i^^d\[Kyo,Xk) + {9 , A{yo,Xk))] 
k=l 

Now, when K tends to infinity, the limit of /K is almost surely given by 
(since the samples are drawn independently) 

Fx{9) = {q\[h{yo,Xk) + {9, A{yo,Xk))]} 

where is the expectation with respect to the probability P{yo, .). This func- 
tional becomes our measure of frustration, which should be minimized in order 
to calibrate 9. 

Assume now that several examples are provided, under the form of a learning 
set yi, ■ . ■ , yN- the new objective function is 

1 ^ 

F\{9)= + ^(yj^Xk))]} 
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3.2 Minimizing F\ 

To simplify the notation, we restrict again to the case of a single example yQ. We 
still have the fact that, for any A, the function F\ is convex, with first derivative 

F'(0) = E,„|(^l + tanh 

According to the discussion of section 0 the transition probability P should, 
to avoid directions of recession, explore a sufficiently large neighborhood of yo, 
to provide enough information on the variations of U . Because of this, it is likely 
that the gradient in o cannot be efficiently computed, neither analytically nor 
numerically. To minimize F\ in such a case, we use a stochastic gradient learning 
procedure, which we describe now: 



A 



(Huo, ■) + (^ > ^{yo, •))) 






( 4 ) 



Learning Procedure 



0. Start with some initial value 9 q 

1. At time n, 9n being the current parameter, draw at random a sample A” 
from the transition probability P{yo, .), and set 



On+I = &n- 7n+i 1 + tanh 



j{h{yo,X^) 



{9,A{yo,X-))) 



jX{yo,X-) 

( 5 ) 



where > 1) is a decreasing sequence of positive gains satisfying = 

+00 and Yln'ln < +c»- 

Standard results in stochastic approximation (see 0, for example), show 
that, in the absence of direction of recession, the sequence (0„) generated by 
this algorithm almost surely converges to the minimizer of F\. 

If there are more than one example yi, ■ ■ ■ , yn , the previous algorithm simply 
has to be modified by taking, at each step, yo at random in the set {yi , . . . , j/m}- 

3.3 Remark 

Notice that, under its most general form, and when the perturbations explore a 
large set of configurations, there is very little chance that there exists a parameter 
set for which all the constraints are truly satisfied, that is for which the energy of 
the correct configurations j/j are smaller than the energies of all the perturbations 
which might be generated by P(yj, ■)• This could be made possible by designing 
an energy with a very large number of terms, which will then essentially work 
as an associative memory (like an Hopfield neural net 0), in which the correct 
configurations are stored, but this certainly is not a desirable feature of an energy 
function in image processing. A more efficient goal is to learn some common 
important trends of the correct configurations, and not all their peculiarities, in 
which case having some residual frustration is not a problem. 
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4 Illustration 

4.1 Description 

We illustrate this methodology with binary example. Let 17 be the set of confi- 
gurations X = {xs,s £ S ,M}^) with = 0 or 1 for all s. We define 

an energy U{x) on 17 as follows. 

Let Uo{x) = Xg. For a radius r > 0 and a direction a £ [0, 27t[, we define 
an energy term Ua,r which operates as an edge analyzer in the direction a, with 
scale r. 

For s = {i,j) £ S, let ®s(r) be the discrete ball of center s and radius r, ie. 
set of all s' = £ S such that {i — -|- (j — j')^ < r^. For each direction 

a, divide this ball in two parts 'Bf{r,a) and 'Bj{r,a) according to the sign of 
(i — i') cos a + {j — f) sin a, then define 

^ ^ ^ ^ ^ ^ ^ s' 

s^S {'r,a) s' {r,o') 

Finally, select a series of pairs (r^, ai) for z = 1, . . . , d, and set 
U{x) = 0oUo{x) + y^9iUr,,ai{x) 

i=l 

Our experiment will consist in learning the parameters Oq, . . . ,9d on the basis 
of a single image yo, and then try to analyze which features of the image have 
emerged in the final model. Notice that we have added a parameter, 9q, for 
the first term C/g, which is also estimated. If there exist parameters such that 
U{x) > U{yo) for all configurations x which can be generated by P(yo,.), the 
extraneous parameter is redundant (only its sign matters), and this creates a 
direction of recession for the minimized functional. But such a case did not seem 
to happen in the present set of experiments, so that, even with one additional 
parameter, the measure of frustration did remain strictly convex. 

For learning, the perturbations P{yo, ■) consist in adding of deleting balls 
of random centers and radii to the configuration j/g. To validate the estimated 
parameters, we run an energy minimization algorithm (simulated annealing with 
exponentially fast decay of temperature) with different starting configurations 
(including the learned image yg itself) to see whether yg is close to the minimizing 
solutions. 

4.2 Experiments 

We have used three pictures (disc, square and triangle, see fig.P), and estimated 
parameters independently for each picture. The results were quite different for 
each image. 

The disc-picture seems to have been perfectly stored, in the sense of an asso- 
ciative memory, by the learned parameters: starting with any initial picture, the 



218 



L. Younes 



final restored picture is a disc, with only minor variations. This is not surprising, 
in fact, since the energy function is itself based on disc-shaped analyzers. 

The square picture is stabilized by the restoration algorithm, again with 
minor variations, so that the estimation has suceeded in making this picture 
(almost) a local minimum of the energy. However, starting from other configu- 
rations does not always result in a white square on a dark background, and a 
phenomenon reminiscent of phase transition can be observed (see fig. 0. This is 
due to the fact that, in the square picture, the number of white pixels is almost 
equal to the number of black pixels. 

Finally, the triangle picture is not even stabilized by the restoration algo- 
rithm. It is in fact significantly modified, as shown in fig. El As stated before, 
it would not be difficult to design an energy with additional terms in order to 
perfectly store the triangle. It is however more interesting to stay with a given 
energy, and analyse which features for the triangle picture have been learned. 
This can be seen in fig. Q where the restored picture from a uniformly white 
input clearly has nothing to do with a triangle, but shares essential local features, 
in particular regarding the orientations of the boundaries. 





Fig. 1. Pictures of disc, square, triangle 





Fig. 2. Starting with a white picture with parameters estimated from the disc 
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Fig. 3. Starting with a black picture with parameters estimated from the disc 




Fig. 4. Starting from the disc with parameters estimated from the square 
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Fig. 5. Starting with a white picture with parameters estimated from the square, 
exhibiting a phase-transition-like phenomenon 




Fig. 6. Output of the restoration algorithm, initialized with the triangle, and using 
parameters estimated from the triangle 
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Fig. 7. Starting with the white picture with parameters estimated from the triangle 

5 Extension to Data-Dependent Cost Functions 

5.1 Generalities 

In a typical use of energy minimization methods for image analysis, one (or 
several terms) in the energy depends on an extraneous configuration of observed 
data like the first term in equation CD- Such situations directly arise from the 
Bayesian framework which has been introoduced in |3|, and applied many times 
since then. 

In this case, the calibrated parameters should be able to adapt to variations 
of the data, and 0i, ■ ■ ■ ,0^ should be functions of One simple way to address 
this is to model each 9i as a linear combination of some fixed functions of as 
in regression analysis: 

K 

i=i 

The functions are fixed in the learning procedure. They should be relevant 
statistics of the data, for the given application. From a formal point of view, we 
are back to the framework of section El with the new energy terms 

and parameters (3ij. However, in this case, it is clear that learning can only 
be performed on the basis of sufficiently large number of correct analyses, of 
the kind (Cii J/i)j • ■ • ; UnjUn), since we are going to estimate functions of the 
variable 

An alternative to choosing fixed functions <Pj is to set <Pj = 'P{hj + {Wj , ^)) 
where hj G ]R and Wj is a vector of same dimension as which also have to be 
estimated. Here ^ is a fixed function, typically sigmoidal. It is not hard to adapt 
the stochastic gradient descent algorithm to deal with this model, which will 
have more learning power than the initial linear combinations. The counterpart 
of this is that the measure of frustration is not convex anymore. 

We now illustrate this approach by considering a simple unidimensional fra- 
mework. 
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5.2 A ID Example 

We consider the issue of smoothing a function ^ : [0, 1] JR. Fixing a discre- 
tization step 6 = 1/M, we let = ^{k6) and consider the cost function 



where x is the unknown smooth signal. 

To calibrate the parameters, we let be regularly spaced 

quantiles of the distribution of {^k — Cfc-i) look for A in the form 



The learning dataset is generated by first simulating the smooth signal x by 
random linear combinations of cosine functions on [0, 1]: 

K 



where the ap, ojp and (j)p are random; ^ is obtained from x by adding a gaussian 
white noise of random variance a^. The random perturbations in the learning 
procedure consisted in adding a small variantion to one or several x^s. 

The learning procedure achieved the estimation of A as a linear function of 
the distribution of the It is an odd function of the quantiles, which 

implies that it is not affected if a constant value is added to — ^i-i (ie. a 
linear term added to ^i). It can be very tightly approximated by the polynomial 
250g^ -I- 3.1 * q, which means that a linear combination of the 8th 

centered moment and the variance of the 

The cost function U has been minimized on test data generated indepen- 
dently, and some results are shown in fig. 0 

6 Conclusion 

In this paper, we have developed a new learning framework for calibrating para- 
meters of energy functionals, as used in image analysis. Given a probabilistic way 
for building wrong examples from correct ones, we have introduced a stochastic 
gradient algorithm which consistently estimates parameters, in order to minimize 
a measure of frustration designed to wrong examples to have a larger energy than 
correct ones. An extension of the method in the case of data-dependent ener- 
gies have been proposed, resulting in an adaptive set of parameters reacting to 
the statistical distribution of the data. The approach has been illustrated by a 
preliminar series of experiments. 

We are now aiming at developping this approach to deal with realistic ima- 
ging problems. We are, in particular, studying image segmentation energies, and 
developing 2D perturbations to learn parameters. 



N 



N 



U{^,x) = - Xk)'^ -1- A^(a;fc - Xk-i)'^ 



A = ^A,<?,(7) 




Calibrating Parameters of Cost Functionals 223 



References 

1. R. Azencott, Image analysis and markov fields, in Proc. of the Int. Conf. on Ind. 
and Appl. Math, SIAM, Paris, 1987. 

2. A. Benveniste, M. Metivier, and P. Priouret, Algorithmes Adaptatifs et Ap- 
proximations Stochastiques, Theorie et Application, Masson, 1987. 

3. S. Geman and D. Geman, Stochastic relaxation, gihbs distributions, and the 
bayesian restoration of images, IEEE Trans. PAMI, 6 (1984), pp. 721-741. 

4. J. J. Hopfield, Neural networks and physical systems with emergent collective eom- 
putational abilities, Proc. Nat. Acad. Sci. USA, 79 (1982), pp. 2554-2558. Biophy- 
sics. 

5. D. Mumford and Shah, Optimal approximation by piecewise smooth functions and 
variational problems, Gomm. Pnre and Appl. Math., XLII (1988). 




Coupled Geodesic Active Regions 
for Image Segmentation: A Level Set Approach 



Nikos Faragios' and Rachid Deriche^ 

^ Siemens Corporate Researcli, 

Imaging and Visualization Department, 

755 College Road East, Princeton, N,1 08540, USA 
E-mail: nikos@scr.siemens.com 

^ I.N.R.I.A. 

B.P. 93, 2004 Route des Lucioles, 

06902 Sophia Antipolis Cedex, France 
E-mail: der@sophia.inria.fr 



Abstract. This paper presents a novel variational method for image seg- 
mentation that unifies boundary and region-based information sources 
under the Geodesic Active Region framework. A statistical analysis based 
on the Minimum Description Length criterion and the Maximum Likeli- 
hood Principle for the observed density function (image histogram) using 
a mixture of Gaussifin elements, indicates the number of the different re- 
gions and their intensity properties. Then, the boundary information 
is determined using a probabilistic edge detector, while the region in- 
formation is estimated using the Gaussian components of the mixture 
model. The defined objective function is minimized using a gradient- 
descent method where a level set approach is used to implement the 
resulting PDE system. According to the motion equations, the set of 
initial curves is propagated toward the segmentation result under the 
influence of boundary and region-based segmentation forces, and being 
constrained by a regularity force. The changes of topology are natu- 
rally handled thanks to the level set implementation, while a coupled 
multi-phase propagation is adopted that increases the robustness and 
the convergence rate by imposing the idea of mutually exclusive prop- 
agating curves. Finally, to reduce the required computational cost and 
the risk of convergence to local minima, a multi-scale approach is also 
considered. The performance of our method is demonstrated on a variety 
of real images. 



1 Introduction 

The segmentation of a given image is one of the most important techniques for 
image analysis, understanding and interpretation. 

Feature-based image segmentation is performed using two basic image pro- 
cessing techniques: the boundary-based segmentation (which is often re- 
ferred as edge-based) relies on the generation of a strength image and the ex- 
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Fig. 1. Multi-phase Coupled Geodesic Active Regions for Image Segmentation: the 
flow chart,. 

traction of prominent edges, while the region-based segmentation relies on 
the homogeneity of spatially localized features and properties. 



— Early approaches for boundary-based image segmentation have used lo- 
cal filtering techniques such as edge detection operators. However, such ap- 
proaches have difficulty in establishing the connectivity of edge segments. 
This problem has been confronted by employing Snake/Balloons models [6, 
1 2] which also require a good initialization step. Recently, the geodesic active 
contour model has been introduced [3, 13] which combined with the level set 
theory [14] deals with the above limitation resulting in a very elegant and 
powerful segmentation tool. 

— The rogion-basnd methods are more suitable approaches for image segmen- 
tation and can be roughly classified into two categories: The region-growing 
techniques [2] and the Markov Random Fields based approaches [9]. The 
region growing methods are based on split- and- merge procedures using sta- 
tistical homogeneity tests [7,26]. Another powerful region-based tool, which 
has been widely investigated for image segmentation, is the Markov Ran- 
dom Fields (MRF) [10]. In that case the segmentation problem is viewed as 
a statistical estimation problem where each pixel is statistically dependent 
only on its neighbors so that the complexity of the model is restricted. 

— Finally, there is a significant effort to intograto boimdary-basod with 
region-based segmentation approaches [4, 21, 26]. The difficulty lies on 
the fact that even though the two modules yield complementary informa- 
tion, they involve conflicting and incommensurate objectives. The region- 
based methods attempt to capitalize on homogeneity properties, whereas 
boundary-based ones use the non-homogeneity of the same data as a guide. 
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In this paper, a unified approach for image segmentation is presented that is 
based on the propagation of regular curves [4, 5, 23, 24, 26] and is exploited from 
the Geodesic Active Region model [19,20]. This approach is as an extension 
of our previous work on supervised texture segmentation [18,20]. 

This approach is depicted in [fig. (1)] and is composed of two stages. The 
first stage refers to a modeling phase where the observed histogram is approx- 
imated using a mixture of Gaussian components. This analysis is based on the 
Minimum Description Length criterion and the Maximum Likelihood Principle, 
denotes the regions number as well as their statistics, since a Gaussian com- 
ponent is associated to each region. Then, the segmentation is performed by 
employing the Geodesic Active Region model. The different region boundaries 
are determined using a probabilistic module that seeks for local discontinuities on 
the statistical space that is associated with the image features. This information 
is combined with the region one, resulting in a geodesic active region-based seg- 
mentation framework. The defined objective function is minimized with respect 
to the different region boundaries (multiple curves) using a gradient descent 
method, where the obtained equations are implemented using the level set the- 
ory that enables the ability of dealing automatically with topological changes. 
Moreover, as in [25,5,23], a coupling force is introduced to the level set func- 
tions that imposes the constraint of a non-overlapping set of curves. Finally, the 
objective function is used within the context of a coarse to fine multi-scale ap- 
proach that increases the convergence rate and decreases the risk of converging 
to a local minimum. 

The reminder of this paper is organized as follows. In section 2 the Geodesic 
Active Region model which is the basis of the proposed approach is shortly 
presented. The problem of determining the number of regions and their intensity 
properties is considered in section 3. The proposed segmentation framework is 
introduced in section 4, while its implementation issues are addressed in section 
5. Finally, conclusions and discussion appear in section 6. 

2 Geodesic Active Regions 

The Geodesic Active Region [15] model was originally proposed in [16] to deal 
with the problem of supervised texture segmentation and was successfully ex- 
ploited in [19] to deal with the the motion estimation and tracking problem. 

This model will be shortly presented for a simple image segmentation case 
with two hypotheses {h.A,h.B) (bi-modal). In order to facilitate the notation, let 
us make some definitions: 

— Let I be the input frame. 

— Let V(jVj — {TZa,T^b} he a partition of the frame domain into two non- 
overlapping regions {TZa HTJ-b = 0}. 

— And, let {dTZ} be the boundaries between TZa and TZb- 

The Geodesic Active Region model assumes that for a given application some 
information regarding the real region boundaries and some knowledge about the 
desired intensity properties of the different regions are available. For example, 
let [pc(f(s))] be the boundary density function that measures the probability of 
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a given pixel being at the boundaries between the two regions. Additionally, let 
Pb (/(•^))] be the conditional intensity density functions with respect 
to the hypothesis and hs- 

Then, the optimization procedure refers to a frame partition problem [de- 
termined by a curve that is attracted by the region boundaries] based on the 
observed data, the associated hypotheses and their expected properties. This par- 
tition according to the Geodesic Active Region model is given by: 



boundary attraction 


egularity 


E{dTl) = a r g\ 
Jo 1 
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where 572.(c) : [0, 1] TZ^ is a parameterization of the region boundaries in a 
planar form, a G [0, 1] is a positive constant balancing the contribution of the two 
terms, and g{) is a positive monotonically decreasing function {e.g. Gaussian). 
The interpretation of the above objective function is clear, since 

a curve is demanded \dTZ\ that: 

— is regular [regularity] , of minimal length and is attracted by the real bound- 
aries between the regions TZa and TZb [eq. (2): boundary attraction] : Bound- 
ary Term, 

— and defines a partition of the image that optimizes the segmentation map by 
maximizing the a posteriori segmentation probability [20]: Region Term. 

The minimization of this function is performed using a gradient descent 
method. If u = {x,y) is a point of the initial curve, then the curve should 
be deformed at this point using the following equation: 



du 

dt 



(1 - 0!)\g(pA{t{u))) - g{pB(t{u)))] + 

region — based force 

- Vtjf(pc (/(«))) •A'(u)) 



boundary — based force 



A''(n) 



The obtained PDE motion equation has two kind of forces acting on the curve, 
both in the direction of the normal inward normal. 
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— Region force 

This force aims at shrinking or expanding the curve to the direction that 
maximizes the a posteriori segmentation probability according to the obser- 
vation set and the expected intensity properties of the different regions. 

— Boundary force 

The force aims at shrinking the curve towards the boundaries between the 
different regions being constrained by the curvature effect. 



3 Regions and their Statistics 

In order to simplify the notation and to better and easily introduce the proposed 
model, let us make some definitions: 

— Let H (I) be the observed density function (histogram) of the input image, 

— Let V{TZ) — {IZi : i G be a partition of the image into N non- 
overlapping regions, and let dV{7l) = {dTZi : i G be the region 

boundaries, 

— And, let hi be the segmentation hypothesis that is associated with the region 

TZi- 



The key hypothesis that is made to perform segmentation relies on the fact 
that the image is composed of homogeneous regions. In other words, we assume 
that the the intensity properties of a given region (local histogram) can be de- 
termined using a Gaussian distribution and hence the global intensity properties 
of the image (image histogram) refer to a mixture of Gaussian elements. 

Let p{.) be the probability density function with respect to the intensity 
space of the image / (normalized image histogram H (/)). If we assume that this 
probability density function is homogeneous, then an intensity value x is derived 
by selecting a component k with a priori probability Pk and then selecting this 
value according to the distribution of this element Pfc(). This hypothesis leads 
to a mixture model of Gaussian elements 



N 

k = l 






1 

\/ 27T(7^ 




This mixture model consists of a vector 0 with iN — 1 unknown parameters 
0 = {{Pk, f-i'k, <^k) ■ k G [1,..., A']}: (i) The number of components [A^], (ii) the 
a priori probability of each component [Pfc], (iii) and, the mean [/i^] and the 
standard deviation [(T^] of each component. 

Hence, there are two key problems to be dealt with: the determination of 
the components number and the estimation of the unknown parameters 0 of 
these components. These problems are solved simultaneously using the Minimum 
Description Length (MDL) criterion [22] and the Maximum Likelihood Principle 
(ML) [8]. Thus, given the data sample and all possible approximations using 
Gaussian Mixture models, the MDL principle is to select the approximation 
which minimizes the length of the mixture model as well as the approximation 
error using this model. In other words with more complex mixture models, the 
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Fig. 2. (a) Input Image, (b) Image Histogram and its approximation: Components 
Number: Mean Approximation Error: 1.0^6^1e-05, Iterations Number: 117, (c) Re- 

gion Intensity Properties [Component I; black pants, Component background. Com- 
ponent 3: (hair, t-shirt). Component skin], 

approximation is better and the error is minimized but at the same time the 
cost induced by the model is significant since more parameters are required for 
its description. Thus, a compromise between the components number and the 
approximation error has to be obtained. 

This is done using the MDL principle, where initially a single node Gaussian 
mixture is assumed. Then, the number of mixture modes is increased and an 
estimation of the mixture parameters is performed. These parameters are used 
to determine the MDL measurement for the current approximation. If the ob- 
tained measurement is smaller than then one given by the approximation with 
a smaller number components, then the number of components is increased. Fi- 
nally, the approximation the gives the minimum value for the M l)L measurement 
is selected. The performance of this criterion is demonstrated in [fig. (2, 6)]. 

4 Image Segmentation 

Given the region number as well their expected intensity properties, we can pro- 
ceed to the segmentation phase. Two different modules are involved, a boundary 
and a region-based. 



4.1 Determining the Boundary Information 

The first objective is to extract some information regarding the real boundaries of 
each region. This can be done by employing an edge detector, thus by seeking for 
high gradient values on the input image. Given the hypothesis that this image 
is composed of homogeneous regions, this method will provide reliable global 
boundary information. However, this information is blind, since its nature cannot 
be determined. In other words, a pixel with important gradient value (boundary 
pixel) cannot be attributed to the boundaries of a specific region [57^,]. 

Here, an alternative method is proposed to determine the boundary-based 
information [1 7]. Let ,s be a pixel of the image, N (s) a partition of its local neigh- 
borhood, and the Nr{s) and Np{s) be the regions associated with this partition. 
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Moreover, let (^(j'V(s))) be the boundary probability density function with 
respect to the k hypothesis, [p(f(iV(s)) l-B*:)] be the conditional boundary prob- 
ability and [p(f(jV(.s))|hij;)] be the conditional non-boundary probability. Then, 
using the Bayes rule and making some assumptions regarding the global a priori 
boundary probability [17] it can be easily shown that the probability for a pixel 
s being at the boundaries of k region, given a neighborhood partition N[s) is 
given by, 

' p('(JV(«))|«j)+p(/(A'(»))|B,) 

The conditional boundary/non-boundary probabilities can be estimated di- 
rectly from known quantities (see [17] for details). Thus, 

k Boundary Condition: 

If s is a boundary pixel, then there is a partition [jVi(.s), A^r(.s)] where the 
most probable assignment for the “left” local region is k and for the “right” 
j [j k], or vice-versa, 

k Non-Boimdary Condition: 

On the other hand, if s is not a k boundary pixel, then for every possible 
neighborhood partition the most probable assignment for the “left” as well 
as for the “right” local region is k, or i and j where {i,j} 7 ^ k. 

As a consequence, the conditional k boundary/non-boundary probability 
density functions are given by, 

p(r{N(s))\Bk) = pk(l{NR{s))) Pj(7(Al(s))) -I- Pj{I{Nr(s))) pk{l{NL{s))) 

' V ' ' V ' 

p{I{N{s))\Bk) = pk{I{NR.{s))) pk{I{NL{s))) + pi(I{NR(s))) pj{I{NLs))) 

where {i^j} can be identical and 

- Pk{l {N r{s))) is the probability of “right” local region [AIr(.s)] being at the 
k region, given the observed intensity values within this region [7(A1 r(s))], 

- pj{I{Np(s))) is the probability of “left” local region [AT(s)] being at the j 
region, given the observed intensity values within this region [/(Ait(s))]. 

Given the definition of the probability for a pixel ,s being a k boundary point, 
the next problem is to define the neighborhood partition. We consider four dif- 
ferent partitions of the neighborhood and the local neighborhood regions are 
considered to be 3 X 3 directional windows. We estimate the boundary probabil- 
ity for all partitions by using the mean values over these windows, and set the 
boundary information [pR.ft(s)] for the given pixel s with respect to the k us- 
ing the partition with the maximum boundary probability. The same procedure 
is followed for all regions, given their intensity properties (Gaussian compo- 
nent) resulting on N boundary-based information images [pR.ft(s) : € [1, A']]. 

A demonstration of the extracted boundary information using this framework 
can be found in [fig. (3)]. 
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4.2 Setting the Energy 

The proposed method has made implicitly the assumption that the image is 
composed of N regions and a given pixel s lies always between two regions 
[TZi, TZki]- However, given the initial curves [regions] positions, some image pixels 
might not belong to any region. Moreover, other image pixels might be attributed 
to several regions. 

To deal with this problem, a temporal spending region TZo has to be con- 
sidered. This region (i) does not correspond to a real hypothesis (it is 
composed from pixel with different hypotheses origins), (ii) does not have 
a predefined intensity character (it depends from the latest segmentation 
map) and (iii) has to be empty when convergence is reached. The next 
problem is to define the intensity properties of this region, thus the probability 
density function po{)- This can be done by seeking the non-attributed image 
pixels and estimating directly from the observed intensity values the probability 
density function po()- 

Then, the segmentation task can be considered within the geodesic active 
region framework where the region information is expressed directly from the 
Gaussian elements of the mixture model [pi()j estimated in the observed im- 
age [pi(/(s))]. Thus, the proposed framework consists of minimizing following 
objective function. 



E{V(TZ)) = « ^ yy gjPi ,(TR) dxdy - 

region fitting 



i=0 



Hi 

N 



A' , 

(f - “) y a {PB,i {dUi[ci )) , \dili{cj)\ dci 



boundary attraction regularity constraint 



where dTZi{ci) is a parameterization of the region TZi boundaries into a planar 
form, and g{x, it) is a Gaussian function. 

Within this framew'ork the set of the unknown variables consists of the differ- 
ent region boundaries (curves) {dTZi\. The interpretation of the defined objective 
function is the same with the one presented in section 2 for the bi-modal Geodesic 
Active Region framew'ork. 



4.3 Minimizing tho Energy 

The defined objective function is minimized using a gradient descent method. 
Thus, the system of the Euler-Lagrange motion equations with respect to the 
different curves (one for each region) is given by: 

yiG[i,N], 

^9TZi =a \p{pi{I{0'R.i)),(TR) - g{pki{T{d'R.i)),(TR)]jVi{dTli) + 

1 Region — based force 

(1 — a) (g(pB,iidTZi), + Vg{pB,i{dTZi), (tb) ■ Ni{dTZi)) Ni{dTZi) 



Boundary — based force 
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Fig. 3. Boundary information with respect to the different regions for the woman image 
[fig. (2. a)], (a) Region I (black pants), (b) Region 2 (background), (c) Region 3 (hair, 
t-shirt), (d) Region 4 (skin). 

where /C,- (resp. Mi) is the Euclidean curvature (resp. normal) with respect to 
the curve dlti- 

Moreover, the assumption that the pixel dTZi lies between the regions 
TZi and TZki was done implicitly to provide the above motion equations and 
the probability Pki{) is given by, 

Vk Isl = I ^ 

^ rn := max {pm(^) : m £ [L, A^], m ^i,s £ Tim} 

Thus, if the given pixel is not attributed to any region, then the 
spending region distribution pn(.) is used to determine the h, hypoth- 
esis. On the other hand, if this pixel is already attributed to one, or 
more than one regions, then the most probable hypothesis is used. 

These motion equations have the same interpretation with the one presented 
in section 2. Moreover, they refer to a multi-phase curve propagation since 
several curves are propagated simultaneously. In other words, each region 
is associated with a motion equation and the propagation of a single or multi- 
component initial curve. However, within this system of motion equations there 
is no interaction between the propagations of the different curves. 

4.4 Level Set Implementation 

The obtained motion equations are implemented using the pioneering work of 
Osher and Sethian [14] , the level set theory where the central idea is to represent 
the moving front dTZ{c,t) as the zero-level set {<(){dTZ{c,t),t) = 0} of a function 
(f). This representation of dTZ{c,t) is implicit, parameter-free and intrinsic. Ad- 
ditionally, it is topology-free since different topologies of the zero level-set do 
not imply different topologies of cj). It is easy to show, that if the moving front 
evolves according to [■^dTZ{c,t) — F[dTZ[c,t)) A^] for a given function F, then 
the embedding function deforms according to f) = P[p) f’®’’ 

this level-set representation, it is proved that the solution is independent of the 
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embedding function 4 >, and in most of the cases is initialized as a signed distance 
function. 

Thus, the system of motion equations that drives the multi-phase curve prop- 
agation for segmentation is transformed into a system of multiple surfaces evo- 
lution given by, 

r v*€[i,.v], 

1 ~ g{pk,{I{s)),(Tn)) |Vd>i(s)| -f 

i (1 |V^i(s)| -I- Vy(/;B,i(s),(TB) • V<)ii(,s)) 



4.5 Coupling the Level Sets 

The use of the level set methods provides a very elegant tool to propagate curves 
where their position is recovered by seeking for the zero level set crossing points. 
Moreover, the state of given pixel with respect to a region hypothesis can be 
easily determined since if it belongs to the region, then the corresponding level 
set value is negative. On the other hand if it does not belong to it, then the 
corresponding value is positive. Additionally, since we consider signed distance 
functions for the level set implementation, a step further can be done by esti- 
mating the distance of the given pixel from each curve. This information is very 
valuable during the multi-phase curve propagation cases where the overlapping 
between the different curves is prohibited. 

However, the overlapping between the different curves is almost an inevitable 
situation at least during the initialization step. Moreover, the case where an 
image pixel has not been attributed to any hypothesis may occurs. Let us now 
assume that a pixel is attributed initially to two different regions (there are two 
level set functions with negative values at it). Then, as in [25, 5, 23], a constraint 
that discourages a situation of this nature can be easily introduced, by adding 
an artificial force (always in the normal direction) to the corresponding 
level set motion equations that penalizes pixels with multiple labels (they are 
attributed to multiple regions). Moreover, a similar force can be introduced to 
discourage situations where pixels are not attributed to any regions. This can 
be done by modifying the level set motion equations as, 



' V»€[1,A], 

C oupling force. 

- g{pki{r{n)),(TR)] IV^,),?)! + 

Kegion force 

^ + Vy(pB,i(s),(7B) • |V<^i(s)| 



Boundary force 
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where j3, 7 , (J are positive constants [/? + 7 + = 1], and the function Hi{, </>()) is 

given by 



, / / j 0, if m = * 

Let ns now interpret the new artificial force that has been added to the 
motion equation i, for a given pixel s: 

Expanding Effect: 

If this pixel does not belong to any region, then the new force is negative, 
equal to fc = —{N — l)|Vi^,'| and aims at expanding the region TZi to occupy 
this pixel (the appearance of non- attributed pixels is discouraged). 

Shrinking Effect: 

On the other hand, if this pixel has been already attributed to another region 
\Tlk\, then the level set \(^k\ will contribute with a positive force that aims 
at shrinking the region IZi (the overlapping is discouraged) . 

Although the selection of the function [Lfi(</>())] seems to fulfill the required 
conditions (mutually exclusive propagating curves, non overlapping, no “empty” 
pixels), it encounters some problems. Thus, the non-attributed pixels are penal- 
ized with the same manner to the ones that have been attributed to multiple 
regions. Finally, the defined coupling function is discontinuous which is a not 
desirable property since it creates stability problems during the level set evolu- 
tion. 

To summarize, the coupling function 
has to be redefined by taking into ac- 
count the following considerations: 

i. A pixel that is already attributed to 
a region j and is far away from dTZj , 
should strongly discourage the evo- 
lution of the level set to include 
this pixel in TZi , 

ii. A pixel which belongs to the region 
TZj and is close to its boundaries can 
be reached or be liberated by dTZj 
during the next few iterations, and 
hence, the coupling force introduced 
by the j level set function should 
“tolerate” a temporal overlapping. 

Thus, inspired by the properties of 
the trigonometric functions, the coupling force is defined as. 
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Fig. 4. The trigonometric basis of the 
level set coupling function. 
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where the basis function Ha{x) is shown in [fig. (4)] and is given by: 



ffa{x) 



+ I , if a: > a 
— 1, if a: < —a 
il^tan(a;/a) , if |a:| < a 



In any case, the selection of this function is still an open issue and for the time 
being we are investigating other forms for it. 

To interpret this force via the new function, a level set function and a 

pixel location [s] are considered, 



i. If s is already attributed to another region, then there is an hypothesis j 
for which (f)j (s) < 0 which will contribute with a positive value (shrinking 
effect) to the coupling force that is proportional to the distance of this pixel 
from the boundaries of TZj, 

ii. A similar interpretation can be done if this pixel is not attributed to any 
region (expanding effect). However, for this case the coupling force has to 
be normalized because it is not appropriate to penalize with the same way 
the situation of overlapping and the case in which the given pixel is not 
attributed to one of the regions. At the same time this force is plausible if 



and only if this pixel is not attributed to any region 






5 Implementation Issues 

However, analyzing the obtained motion equations, some hidden problems might 
be observed due to the fact that the region forces are estimated using a single 
intensity-based probability value. However, for real image segmentation cases 
there is always an overlap between the Gaussian components that characterize 
the different regions. Furthermore, due to presence of noise, isolated intensity 
values incoherent with the region properties can be found within it. As a con- 
sequence, it is quite difficult to categorize a pixel, based on its very local data 
(single intensity value). 

To cope with these problems, a circular window approach can be used, as 
proposed in [26]. Hence, a centralized window is defined locally and the region- 
based force is estimated as the mean value of the region-based forces of the 
window pixels [fig. (5, 7,8)]. However, here opposite to [26] where all the window 
pixels were equally considered, the distance between the window pixels and the 
window center is used, and these pixels contribute to the region with weights 
inversely proportional to their distances. 

A more elegant solution can be obtained by considering a multi-scale ap- 
proach. It is well known that the use of multi-scale techniques reduces signifi- 
cantly the required computational cost of the minimization process and performs 
a smooth operation to the objective function that reduces the risk of converging 
to local minima. The main idea consists in defining a consistent coarse-to-fine 
multi-grid contour propagation by using contours which are constrained to be 
piecewise constant over smaller and smaller pixel subsets [11]. The objective 
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Fig. 5. Segmentation for the woman image [fig. (2. a)]. Multi-phase Curve Propagation. 
A random initialization step is used with a large number of spoiled regions. The initial 
regions are the same for all hypothesis. (1) Region 1 (black pants), (2) Region 2 (skin), 
(3) Region 3 (background), (d) Region ^ (hair, t-shirt). 



function which is considered at each level is then automatically derived from the 
original finest scale energy fnnction. Additionally, the finest data space is used at 
each level, and there is no necessity for constructing a multi-resolution pyramid 
of the data. More details about the multi-scale implementation of the proposed 
segmentation framework can be found at [17, 15]. 

As for the selection of the model parameters, we have observed that in most 
of the cases the region force is more reliable since it is estimated over blocks. 
On the other hand, the boundary force ensures the regularity of the propagating 
curves. Finally, the coupling force is considered less, with a progressive way since 
it has been introduced artificially and has a complementary role. Taking into ac- 
count these remarks the following settings are used [0 ss 0.20, 7 0.35, S ss 0.45]. 

Finally, as it concerns the a parameter of the coupling function it is determined 
using the band size of the Narrow Band algorithm [1] which is used to implement 
the evolution of the level set functions. 

To summarize, the proposed approach. 
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Fig. 6. (a) Input Image, (b) Image Histogram and its approximation: Components 
Number: 5, Mean Approximation Error: 2.283931 e-06, Iteration.^ Number: (*^) 

Region Intensity Properties. 

— Initially, determines the number of regions and their intensity properties, 

— Then, estimates the boundary-probabilities with respect to the different hy- 
potheses, 

— Finally, performs segmentation by the propagation of a “mutually exclu- 
sive” set of regular curves under the influence of boundary and region-based 
segmentation forces. 

6 Discussion, Summary 

In this paper'^, a new multi-phase level set approach for un-supervised image 
segmentation has been proposed. Very promising experimental results were ob- 
tained using real images [fig. (5,7,8)] of different nature (outdoor, medical, etc.). 

As far the computational cost of the proposed method is concerned (an 
ULTHA-10 Sun Station with 256 MB Ram and a processor of 299 MHZ was 
used) we can make the following remarks; the modeling phase is very fast al- 
most real time. On the other hand, the segmentation phase is very expensive. 
The extraction of the boundary information takes approximately 3 to 5 seconds 
for a 256 X 256 image with four different regions, while the propagation phase 
is more expensive due to the fact that there are multiple level set evolutions 
in parallel. Thus, for a 256 x 256 image (Coronal image [fig. (8)]) with a ran- 
dom initialization step, the propagation phase takes approximately 20 seconds. 
However, this cost is strongly related with the regions number, the initial curve 
positions and the parameters of the level set evolution. This cost is significantly 
decreased by the use of the multi-scale approach (three to five times). 

Summarizing, in this paper a new variational framework has been proposed 
to deal with the problem of image segmentation. The main contributions of the 
proposed image segmentation model are the following: 

— An adaptive method that determines automatically the regions number and 
their intensity properties, 

' A detailed version of this article can be hnirid at [17], while more experimental results 
(in MPEG format) are available at: 

http: // www-sop .inria.fr /roholvis/pcrsonnel/nparagio/ demos 
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( 1 ) 



(2) 



( 3 ) 



( 3 ) 



(4) 




Fig. 7. The segmentation of the house image into five regions. Curve propagation: left 
to right, (a) House walls, (2) Sky, (b) Ground, (4) Windows, (5) Small trees, flowers, 
shadows. 



— A variational image segmentation framework that integrates boundary and 
region-based segmentation modules and connects the optimization procedure 
with the the curve propagation theory, 

— The implementation of this framework within level set techniques resulting 
on a segmentation paradigm that can deal automatically with changes of 
topology and is free from the initial conditions, 

— The interaction between the different curves [regions] propagation using an 
artificial coupling force that imposes the concept of mutually exclusive prop- 
agating curves, increases the convergence rate, and eliminates the risk of 
convergence to a non-proper solution, 

— And, the consideration of the proposed model in a multi-scale framework, 
which deals with the presence of noise, increases the convergence rate, and 
decreases the risk of convergence to a local minimum. 



As far the future directions of this work, the incorporation to the model of a 
term that accounts for some prior knowledge with respect to expected segmen- 
tation map is a challenge (constrained geodesic active regions). 
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Fig. 8. Segmentation in five regions of a Coronal Medical image. 
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Abstract. We propose a variational framework for determining global 
minimizers of rongh energy functionals used in image segmentation. Seg- 
mentation is achieved by minimizing an energy model, which is comprised 
of two parts; the first part is the interaction between the observed data 
and the model, the second is a regularity term. The optimal bounda- 
ries are the set of curves that globally minimize the energy functional. 
Our motivation comes from the observation that energy functionals are 
traditionally complex, for which it is usually difficult to precise global 
minimizers corresponding to “best” segmentations. Therefore, we focus 
on basic energy models, which global minimizers can be explicitly de- 
termined. In this paper, we prove that the set of curves that minimizes 
the image moment-based energy functionals is a family of level lines, i.e. 
the boundaries of level sets (connected components) of the image. For 
the completeness of the paper, we present a non-iterative algorithm for 
computing partitions with connected components. It leads to a sound 
initialization-free algorithm without any hidden parameter to be tuned. 



1 Introduction 

One of the primary goals of early vision is to segment the domain of an image 
into regions ideally corresponding to distinct physical objects in the scene. While 
it has been clear that image segmentation is a critical problem, it has proven 
difficult to precise segmentation criteria that capture non-local properties of an 
image and to develop efficient algorithms for computing segmentations. There 
is a wide range of image segmentation techniques in the literature. Many of 
them rely on the design and minimization of an energy function which captu- 
res the interaction between models and image data Conventio- 

nal segmentation techniques generally fall into two distinct classes, being either 
boundary-based or region-based. The former class looks at the image disconti- 
nuities near objects boundaries, while the latter examines the homogeneity of 
spatially localized features inside objects boundaries. Based on these properties, 
each of these has characteristic advantages and drawbacks. Nevertheless, several 
methods combine both approaches mnm- 

Region-based approaches are our main interest. In contrast to boundary- 
based methods, region-based approaches try to find partitions of the image pi- 
xels into zones the most homogeneous possible corresponding to coherent image 
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properties such as brightness, color and texture. Homogeneity is traditionally 
measured by a given global objective function and hard decisions are made only 
when information from the whole image is examined at the same time. In that 
case, the boundaries are the set of curves that minimizes a global energy function. 
Past approaches have centered on formulating the problem as the minimization 
of a functional involving the image intensity and edge functions. Some energy 
models are based on a discrete model of the image, such as Markov random 
fields [I l)|l bl27j or a Minimum Description Length (MDL) representation |l bid] , 
whereas variational models are based on a continuous model of the image |21 
f2 1 |21)f I t)|2bj . More recently, Zhu attempted to unify snakes, region growing and 
energy/Bayes/MDL within a general framework [2S|. Finally, Blake and Zisser- 
man [2| and Mumford and Shah I2U have written about most aspects of this 
approach to segmentation and have proposed various complex functionals whose 
minima correspond to segmented images. In a recent review. Morel and Soli- 
mini HH! have, indeed, shown that most approaches aim at optimizing a cost 
functional which is the combination of three terms: one which ensures that the 
smoothed image approximates the observed one, another which states that the 
gradient of the smoothed image should be small, except on a discontinuity set, 
and a last one which ensures that the discontinuity set has a small length. In 
other respects, while these different approaches offer powerful theoretical fra- 
meworks and minimizers exist j1 . it is often computationally difficult to 

minimize the associated functions. Typically, some embedding procedure, like 
graduate-non-convexity |2| , is used to avoid bad local minima of cost functio- 
nals. A fairly complete analysis is available only for a simplified version of the 
Mumford and Shah model that approximates a given image with piecewise con- 
stant functions m- Moreover, in the area of region-based approaches, layers 
approaches attempted to use both region and boundary information . But the 
number of layers and the values associated with the layers must be known a pri- 
ori or estimated using ad-hoc methods or prohibitive Expectation-Maximization 
procedures. 



The main obstacle of energy model based approaches is to find more effective 
and faster ways of estimating the boundaries and values for regions minimizing 
the energy than those presently available. This motivates the search for global 
minimizers of energy functionals commonly used in image segmentation. The key 
contribution of this paper is to provide basic energy models, which global minimi- 
zers can be explicitly determined in advance. Accordingly, energy minimization 
methods and iterative algorithms are not necessary to solve the optimization 
problem. The energy model introduced in a discrete setting by Beaulieu and 
Goldberg P| and reviewed by Morel and Solimini PI has been the starting 
point for our own work. This model tends to obtain a partition with a small 
number of regions and small variances without a priori knowledge on the image. 
The cost function allows to partition the image into regions, though in a more 
restrictive manner than previous approaches since it can generate 

irregular boundaries PI- In , the energy is efficiently minimized using a split- 
and- merge algorithm. Here, our approach is completely different to determine 
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global minimizers of similar energy models. The present investigation is based 
on a variational model. In Section 2, we prove that the set of curves that mini- 
mizes a particular class of energy models is a family of level lines defined from 
level sets of the image. We list some prior models (Markov connected compo- 
nent fields, entropy prior) which are consistent with this theoretical framework. 
In this sense, the method is deterministic and equivalent to a procedure that 
selects the “best” level lines delimiting object boundaries. The rest of the paper 
is organized as follows. A description of the initialization-free segmentation al- 
gorithm is included in Section 3. In Section 4, experiments on several examples 
demonstrate the effectiveness of the approach. 

2 The Framework 

One approach to the segmentation problem has been to try to globally minimize 
what we call the “energy” of the segmentation. These energy models are usually 
used in conjunction with Bayes’s theorem. Most of the time, the energy is desi- 
gned as a combination of several terms, each of them corresponding to a precise 
property which much be satisfied by the optimal solution. The models have two 
parts: a prior model and a data model E^. The prior term is sometimes cal- 
led the “regularizer” because it was initially conceived to make the problem of 
minimizing the data model well-posed. 

2.1 Minimization Problem 

Our theoretical setting is the following. Let us consider a real-value function J, 
i.e. the image, whose domain is denoted S : [0,a] x [0,6]. In many situations, it 
is convenient to consider images as real- valued functions of continuous variables. 
We define the solution to the segmentation problem as the global minimum of a 
regularized criterion over all regions. 

Let s = (x,y) € S an image pixel, 17^ C S ,i = 1,...,P, an non-empty 
image domain or object and df2i its boundary. We associate with the unknown 
domains the following regularized objective function, inspired from nm]: 

(Exif,ni,...,np) = E^if,f2u...,fip) + APp(f2i,...,I2p) 

1 E,if, [2,,..., [2p) = Ef=i E,if, n,) ^ > 

where / is any integrable function, for instance the convolution of the image I 
with any filter, ifp(f7i, . . . , f2p) is a penalty functional and A > 0 is the regu- 
larization parameter. Some choices of / have been recently listed in Here, 
we just consider the possibility of examining the image at various scales using a 
Gaussian smoothing of the image, including the case of zero variance, i.e. f = I 
and the case of anisotropic diffusion H3. 

Equation m is the most general form of energy we can optimize globally 
at present. We present two appropriate energy models for segmentation which 
attempt to capture homogeneous regions with unknown constant intensities. It 
will be clear that none of these models captures all the important scene variables 
but may be useful to provide a rough analysis of the scene. 
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Least squares criterion: In this modeling, implicitly, a Gaussian distri- 

bution for the noise is assumed [wm . The data model is usually defined as 

P f 

mx,y)-fafdxdy (2) 

i=l 

where f q. denotes the average of / over I7p This means that one observes a 
corrupted function / = + e, where e is a zero-mean Gaussian white noise 

and /true is supposed piecewise constant, i.e. 

P _ r 

/true(s) = V fa, T(s G I7i) where |l7i| = / dxdy, (3) 

i=i da, 

and !{■) is the indicator function. The standard deviation is assumed to be 
constant over the entire image. The image domain S is split into unknown P 
disjoint regions I2i, • • • , 

Contrast statistic criterion: One may be interested in identifying bo- 

undaries corresponding to sharp contrast in the image. We define the contrast 
of a boundary by the difference between the average value of / per unit area on 
the inside of the boundary and the outside of the object, that is the background 
Up |23| ■ Formally, the corresponding data model is 



p-i 

(fn, - ( 4 ) 

i=l 

Regions are assumed to be simple closed curves superimposed on the backgro- 
und. This data model does appear to have a fairly wide application potential, 
especially in medical image analysis and confocal microscopy, where the regions 
of interest appear as bright objects relative to the dark background. 

For the sake of clarity, we restrict ourselves to the first case, i.e. the least 
SQUARES CRITERION, and give major results for the other criterion. 

Our aim is now to define objects in /. Therefore, we define the following class 
Cp , P > 1 of admissible objects 

Cp = {(f2i, . . . , f2p-i) C S are regular, closed and connected ; il, = S 

1 < / / < -P , 3 = 0 ; } 

where the subsets (l?i, . . . , I7p_i) are the objects of the image and flp is the 
background. When P = 1, there is no object in the image. An optimal segmen- 
tation of image / over Cp is by definition a global minimum of the energy (when 
exists) 



{f2l , . . . , I7p.) = infp>i xai(ai,...,np)^CpE\{f , l?i, . . . , Qp) . 



( 5 ) 
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A direct minimization with respect to all unknown domains and parameters 
f Q. is a very intricate problem In the next section, we prove that 

the object boundaries are level lines of function / if the penalty function 
only encourages the emergence of a small number of regions. In our context, 
an entropy prior or Markov connected component fields-based prior are used to 
reduce the number of regions. The parameter A can be then interpreted as a 
scale parameter that only tunes the number of regions m- 

2.2 Minimizer Description and Level Lines 

Our estimator is defined by (when exists) 

(f?i,...,I2p) = argminp>iargmin(^^_ Aa(/, f?i,...,f2p). (6) 

The question of the existence of an admissible global minimum for energies like 
Mumford and Shah’s energy m is a difficult problem (see m for more details) . 
Here, our aim is not to investigate conditions for having an admissible global 
minimum. In what follows, we make an ad-hoc assumption ensuring the existence 
of an unique minimum of the energy m- 

Minimizer description. We propose the following lemma 

Lemma 1 If there exists an unique admissible global minimum and that no pa- 
thological minimum exists m, then the set of curves that globally minimizes the 
energy is a subset of level lines of f: 

f\dOi — h'ii *=!,.. .,T* 1. 

i.e. the border dfli of each fii is a boundary of a level set of f. 

Proof of Lemma 1 Without loss of generality, we prove Lemma 1 for one object 
Q and a background i?'’, where I?'’ denotes the closure of the complementary set 

of 12. For two sets A and B, denote / /=//—//■ Let f2s be a small 

Ja-b Ja Jb 

perturbation of f2, i.e. the Hausdorff distance doo{L2s, f2) < S . Then, we have 

f I[ = |f2,|-|f2| and [/" A - f f A = 2 f f f f+ff A (7) 

Jns-n ' V ' \J Bs J \J n J Jn Jn^-n \Jns-n J 

A{\n\) 

and the following image moments: 

mo = / I , mi = / / , m 2 = / /^ Ko = f 1 , = f f , K 2 = f f .{8) 

Jn Jn Jn Js Js Js 

The difference between the involved energies is equal to 

Ex{f, I2s, - Ex{f, n, fi") = EAf, {2s, - E^f, O, +ABp(fii, - E^{2, 12") . 

"V*' *v“ "V*' 

AEp(i^,f2^) 



(9) 
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Table 1. Coefficients Aq and Ai associated to the two segmentation criteria. 





Ao 


Ai 


Least Square criterion 
Contrast statistic 

CRITERION 


ml 

"Iq (Kg-moA 

2ml 2(Ki-mi)^ 

mo (Ao-mo)3 

2m 1 (1 — 2mo ) +'^ 1 ) 


2mi 1 2(^1— mi) 

mo KQ—mo 

2mi 1 

ml, (Kg-moA 

2(Ai-2mi) 


mg(Ko— mo)^ 


mo(i<:o— mo) 



In Appendix, it is shown that for A(|l7|) — >• 0, AE^(f, H, 17°) is equal to 

= A{\n\) (Ao + Ai/) + o{A{\n\f) (10) 

where Aq and Ai are computed from image moments given in (0). For the two 
criteria described in Section 2.1, the coefficients are listed in table [B Suppose 
we can write AEp{f2, 17°) as 

Z\A,(17,12°) = A{\n\) (Bo + B,f) + 0{A{\n\)^). (11) 

Let So be a fixed point of the border 3l7. Choose 17^ such that dfis = df2 except 
on a small neighborhood of sq. The energy having a minimum for 17, /(sq) needs 
to be solution of the following equation 

AExif, n, 12°) = A(|17|) [(Ao + ABq) + (Ai + ABi)/(so)] + 0{A{\n\f) = 0. (12) 

By pre-multiplying JED by Z\(|l7|) ^ and passing to the limit Z\(|17|) — >■ 0, we 
obtain 



(Aq + XBq) + {Ai + XBi)f{so) — 0. (13) 

Equation JEI) has an unique solution. The coefficients (Aq + XBq) and (Ai + 
XBi) do depend on neither sq nor /(sq), and Aq + XBq ^ 0. The function / is 
continuous and 9l7 is a connected curve. Therefore /(sq) is constant when sq 
covers dQ. □ 

In conclusion, we proved that the global minimizer is a subset of iso-intensity 
curves of the image provided that E\{f,f2i,..., Qp) is explained by second-order 
image moments. In the next section, we list two penalty functionals relying on the 
Markov connected component fields and entropy theories, which are consistent 
with Lemma 1 and dnj. 



Image representation by level sets. In consequence of Lemma 1, object 
borders can be determined by boundaries of level sets. Meanwhile, it turns out 
that the basic information of an image (or function /) is contained in the family 
of its binary shadows or level sets, that is, in the family of sets 5,, defined by 



Srj = {s e S' : /(s) > 77} 



(14) 
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for all values of 77 in the range of / m- In contrast to edge representation, the 
family of level sets is a complete representation of / m- This representation 
is invariant with respect to any increasing contrast change and so robust to 
illumination conditions changes. In general, the threshold set is made up of 
connected components based both on the image gray levels and spatial relations 
between pixels. To extract a connected component of a level set 5,,, we threshold 
the image at the gray level ij and extract the components of the binary image we 
obtain. A more efficient technique has been described in PH). A recent variant 
of this representation is proposed in [,411 iS] by considering the boundary of level 
sets, that is the level lines. This representation does not differ with respect to 
the set of level sets. As a consequence of the inclusion property of level sets, 
the level lines do not cross each other’s. In the following, we basically consider 
that a connected component is an object and the level lines are just a set of 
ry-isovalue pixels at the borders dfli of connected components. 

2.3 Forms of Prior Models 

One of the difficulties in the Bayesian approach is to assign the prior law to reflect 
our prior knowledge about the solution. Besides, in consequence of Lemma 1, the 
set of penalty functionals is limited. The contribution of a given pixel to the prior 
does not depend on the relation with neighbors and the resulting regions may 
have noisy boundaries. Here, the proposed penalty functionals are not necessary 
convex but only enable to select the right number of regions. Instead of fixing a 
priori the cardinality of the segmentation, which is a highly arbitrary choice, it 
seems more natural to control the emergence of regions by an object area-based 
penalty or by an information criterion weighted by a scale parameter A. 



Markov connected component fields. A new class of Gibbsian models with 
potentials associated to the connected components or homogeneous parts has 
been introduced in HD. For these models, the neighborhood of a pixel is not fixed 
as for Markov random fields, but given by the components which are adjacent to 
the pixel. These models are especially applicable for images where a relatively few 
number of gray levels occur, and where some prior knowledge is available about 
size and shape characteristics for the connected components |22j. The Markov 
connected component fields possess certain appealing Markov properties which 
have been established in HZ). 

Here we considered a Markov connected component field which the probabi- 
lity density function is proportional to 

p-i 

exp -I- /3{P-1)‘'~^ + (15) 

i=l 

' V ' 

The parameter 7 controls the size of the components since the squared area of 
the union of two components is greater than the sum of the squared areas of 



248 



C. Kervrann, M. Hoebeke, and A. Trubuil 



each component. The size of the components is however also influenced by the 
parameter a together with the parameters (3 and C which controls the number of 
components. The potential . . . , flp) is the more general functional we can 

use since the boundaries of connected components cannot be penalized in our 
framework. These potentials can be separately used to select the right number 
of regions by setting a, /3 , 7 = {0, 1}. 

In Section 2.2, we proved Lemma 1 for one object 17 and a background 
Using the same notations, we easily write 

AE^{Q,n’^) = qZ\(|U|) + 7(|U5|^ - |U|2) = A(|U|) (a + 27mo) + 7 (16) 

Accordingly, we obtain Bq = {a + 277710) and Bi = 0, which is consistent with 
Lemma 1 and dnj if no pathological events (e.g. topological changes) occurs. 

The application of Markov connected component fields is somewhat more 
computationally demanding than the application of Markov random fields. By 
the local Markov property the calculations for an update of a site in a single site 
updating algorithm only involves the components adjacent to this site. Our work 
may be regarded as an preliminary exploitation of the theoretical framework 
described by Mpller et al ca in image segmentation. 



Entropy prior. The entropy function has been widely used as a prior in a 
Bayesian context for image restoration. Here, the entropy of the segmented image 
is written as follows jSj 

Ep(l7i,...,l7p) = ~^p,lnp, = “Xl 

i=l i=l II II 

where the PiS represent the histogram values, 1 17^ | the cardinality of region 17^ and 
[S'! the cardinality of the image domain. The value Pi is the number of occurrence 
of the gray level value f q. in the segmented image. The histogram entropy 
is minimized for a Dirac distribution corresponding to one single class in the 
segmented image. In image segmentation, we want to obtain a histogram sharper 
than the histogram of the initial image, so the entropy should be minimized | 0 I. 
The actual reduction of number of classes is obtained from the information prior 
Ep(l7i, . . . , flp). Using the notations introduced in Section 2.2, we write 

AE,{n,n^) = In + 0{Amr). (18) 

Accordingly, we obtain Hg = |^ In l'^l|^l|^l and Bi = 0, which is consistent with 
Lemma 1 and (CD- 

2.4 Properties of the Energy Models 

In this section, we complete the analysis of energy models and discuss the connec- 
tions with image partitioning algorithms. 
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Upper bound of the objects number. It appears, most of the time, that 
variations in the values of the parameters A have significant effects on the qua- 
litative properties of the minimizer |28| . We show that the maximum number of 
objects is explicitly influenced by A. 



Lemma 2 If there exists an ovtimal sepmentation defined hv m and then 
the optimal number P* of objects is upper bounded by 

-Pmax = l-b(A f (f{x,y) - fsfdxdy 

Js 

P-1 

i/ Llp(Ui, . . . , Up) = ^ |Ui|, i.e. a = 1, /? = 7 = 0. 

i=l 

Proof of Lemma 2 : 



p^-i 



A |U*| < i?A(/,Ur,...,U|,.) < L;a(/,5) = [ {f{x,y)~ fsfdxdy. 

p._i 

If |Ui| > |U„i„|, we have (P* - l)|U„i„| < ^ |U*| < A"^ / {f{x,y) - fsf dxdy 

■ 1 o 5 



and P* < l-b(A |U,„in|) ^ / {f{x,y) - fsY dxdy 



□ 



Connection to snakes and geodesic active contour models. Let Vi{s) = 
{xi{s),yi{s)) denote a point on the common boundary dI2i (parametrized by 
s G [0, 1]) of a region and the background Up. We suppose 



Pa(Ui, • • • , Up) = 



p 

E 

2=1 ’ 



U{x,y) - fn.f dxdy + A 






p-i 

E 

i=l 



u,. 



The time t dependent position of the boundary dI2i can be expressed parame- 
trically by Vi{s,t). The motion of the boundary dU^ is governed by the Euler- 
Lagrange differential equation m- For any point Vi{s,t) on the boundary df2i 
we obtain: 



dvi{s,t) 

dt 



5Ex{f,Pi,np) 

Svi{s) 



[{f{x,y) - /o,)^ + A - if{x,y) - fnpf] n{vi{s)) (19) 



where n(vi{s)) is the unit normal to dU^ at point Vi{s,t). This equation can 
be seen as a degenerate case of the region competition algorithm described by 
Zhu et al. m where A is analogous to a pressure term The solving of the 

Euler-Lagrange equations for each region can be complex and the region compe- 
tition algorithm (see |2S|) finds a local minima. Using the level-set formulation 
wm, suitable numerical schemes have been derived for solving propagating 
equations. However, in both cases, seed regions must be provided by the user 
or randomly put across the image, and mean values f^. are updated at each 
step of the iterative algorithm. In this paper, we directly determined the steady 
solutions associated with the motion equation given in (II 911 . 
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3 Segmentation Algorithm 

In practical imaging, both the domain S and the range of / are discrete sets. The 
segmentation algorithm we propose is automatic and does require neither the 
number of regions nor any initial value for regions. This algorithm is not a region 
growing algorithm as described in since all objects are built once and 

for all according to da. Energy minimization is performed once all admissible 
objects have been registered. To implement our level set image segmentation, a 
four step method is used. 

Level set construction The first step completes a crude mapping of each 
image pixel on a given level set. At present, we uniformly quantize the function 
/ S [fmin, fmax] in AT = {4,8,16,32} equal-sized and non-overlapping intervals 
{[li,hi[, . . . , [Ik, hK]}- Given this set of levels, we then assign one of the levels 
to each pixel s: s is assigned to [Ij, hj[ if Ij < f{s) < hj. 

Object extraction A crude way to build pixels sets corresponding to objects 
is to proceed to a connected components labeling and to associate each label 
with an object f2i. The background f2p corresponds to the complementary set 
of objects The list of connected components of each of these then forms 
the list of objects {I2i, . . . fip} where T is the maximum number of connected 
components such as \fli\ > |I2min| and P <T < P„,ax- 

Though this process may work in the noise-free case, in general we would 
also need some smoothing effect of the connected components labeling. So we 
consider a size-oriented morphological operator acting on sets that consists in 
keeping all connected components of the output of area larger than a limit 

Configuration determination The connected components are then com- 
bined during the third step to form objects configurations. Having the ob- 
jects list {l7i, . . . configurations can be built by enumeration of all pos- 
sible object combinations, i.e. 2^ configurations. Each possible configuration 
can then be represented by a binary number bi which is the binary expansion of 
z (0 < i < 2^ — 1). The binary value of each bit in bi determines the presence or 
absence of a given object in the configuration. 

Energy computation Each configuration represents a set of objects which 
in turn is a set of pixels. Energy calculations take the image intensities of the 
original (not quantized) image at each of these pixels to establish mean and 
approximation error. Note that energies corresponding to each object are com- 
puted once and stored, and energy corresponding to the background is efficiently 
updated for each configuration. The configuration that globally minimizes the 
energy functional corresponds to the optimal segmentation. The time necessary 
to perform image segmentation essentially depends on the length of the object 
list, i.e. the number T of connected components. Nevertheless, all configurations 
are independent and could be potentially evaluated on suitable parallel archi- 
tectures. 



Level Lines as Global Minimizers 



251 



4 Experimental Results 

We are interested in the use of the technique in the context of medical and aerial 
imagery and confocal microscopy. Our system successfully segmented various 
images into a few regions. For the bulk of the experiments, we used a slightly 
restricted form, in which the data model is given by 0 and, for the sake of 
clarity, we restrict ourselves to use a single potential at one and the same time, 
i.e. ifp(J7i, . . . , J7p) = P — 1 or Pp(l7i, . . . , f2p) = 1^*1- The last prior 

model can be re-defined to find large regions with low/high intensity in the 
image (see Figs.|^-0. Similar results were obtained using an entropy prior. The 
algorithm parameters were set as follows: K = 4,8, 16 or 32, and regions which 
areas \fii\ < 0.01 x IS"! are discarded. Most segmentations took approximately 
about 4-10 seconds on a 296MHz workstation. Two sets of simulations were 
conducted on synthetic as well as real-world images to evaluate the performance 
of the algorithm. In experiments, the image intensities have been normalized 
into the range [0, 1]. 

Figure shows an artificially computed 256 x 256 image representing the 
superposition of two bidimensional Gaussian functions located respectively at 
So = (64, 128) and si = (160, 128) with variance of cto = 792 and u\ = 1024. 
Figure shows the result of the uniform quantization operation applied on 
Fig. {K = 32). The levels lines associated with the quantized image are 
displayed on Fig. E:- Note that level sets of area too small are suppressed. Figures 
im-f show how the penalization parameter influences the segmentation results 
when Pp(l7i, . . . , I7p) = 1^*1- The white borders denote the boundaries 

of the objects resulting from the segmentation. 

We have applied the same algorithm to an aerial 256 x 256 image depicted 
the region of Saint-Louis during the rising of the Mississipi and Missouri rivers in 
July 1993 (Fig. 01). We are interested in extracting dark regions labelled using 
ff 2 in this image. The level lines corresponding to K = 8 are shown on Fig.0D. 
The approach has successfully extracted significant dark regions and labeled in 
“white” urban areas, forests and fields as “background” (Fig. 0;). 

An example in 2D medical imaging is shown on Fig. 0 Figure 0 shows 
the results of the above method when applied to outline the endocardium of 
a heart image obtained using Magnetic Resonance. This figure illustrates how 
our method selects the number of segments in a 2D medical MR image (179 x 
175 image). The level lines are shown in Fig. 0) and the region of interest is 
successfully located using K = 8 and A = 0.01. 

Confocal systems offer the chance to image thick biological tissue in 2D-|-t 
or 3D dimensions. They operate in the bright-held and huorescence modes, al- 
lowing the formation of high-resolution images with a depth of focus sufficiently 
small that all the detail which is imaged appears in focus and the out-of-focus 
information is rejected. Some of the current applications in biological studies 
are in neuron research. We have tested the proposed algorithm on 2D confo- 
cal microscopy 256 x 240 images (Fig. 0), courtesy of INSERM 413 IFRMP 
n°23 (Rouen, France). Figure 0 depicts a triangular cell named “astrocyte”. 
These cells generally take the place of died neuron cells. In Figs0-c, the seg- 
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Fig. 1. Segmentation results of a synthetic image, a) original image ; b) uniformly 
quantized image (K = 32) ; c) level lines superimposed on the quantized image ; d) 
segmentation with A = 0.01 ; e) segmentation with A = 0.1 - two detected objects; f) 
segmentation with X = 1.0 - one detected object. 



mentation of one single cell is shown. We have preliminary filtered the image 
using anisotropic diffusion HH. The boundaries of the cell components are quite 
accurately delineated in Fig.^ (iF = 8, A = 0.001). 

5 Conclusion and Perspectives 

In this paper we have proposed basic energy functionals for the segmentation 
of regions in images, and we proved that the minimizer of our energy models 
can be explicitly determined. The minimization requires no initialization, and 
is highly parallelizable. A total CPU time of a few seconds for segmenting a 
256 X 256 image on a workstation makes the method attractive for many time- 
critical applications. The contribution of this approach has been illustrated on 
synthetic as well as real-world images. The energies are of a very general form and 
always globally optimizable by the same algorithm. The framework offers many 
other possibilities for further modeling. We are currently studying an adaptive 
quantization technique instead of the uniform quantization used at present to 
estimate the objects. Finally, the extension of the approach to volumetric images 
(confocal microscopy) and multi-spectral images is also of interest. In this setting, 
the structure of the algorithm would be largely the same, although there are a 
number of points which would need to be examined closely. 
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Fig. 2. Segmentation results of an aerial image (A = 0.001). Left: original image. 
Middle: level lines computed from the quantized image {K = 8). Right: label map. 





Fig. 3. Segmentation results of a MR image (A = 0.01). Left: original image. Middle: 
level lines computed from the quantized image {K — 8). Right: boundaries of the object 
of interest superimposed on the original image. 





Fig. 4. Segmentation in 2D confocal microscopy (A = 0.001). Left: original image. 
Middle: boundaries superimposed on a adaptively hltered image {K = 8) . Right: label 
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A Computation of the Energy Variation for the Least 
Squares Criterion 

We compute the energy variation for one object fl and a background where 
17° denotes the closure of the complementary set of fl. The data model is 

^^d(/, = / (f{x,y) - fnf dxdy + [ {f{x,y)-facfdxdy. (20) 



For two sets A and B, denote / /=//—//• Let fls be a small pertur- 

Ja-b Ja jb 

bation of fl, i.e. the Hausdorff distance doo{f^s, fl) < 5 . Then, we define 

[ I = |0,|-|0| and f/ f)-(ff)^2fff f+([ f) .{21) 
Jng-n '' V ^ O/i J \J n J Jn Jn^-n \Jn/:-n J 



A(\n\) 



The difference between the involved energies is equal to Z\£'d(/, 17, 17°) = 
E^{f,fls,fll)-E^{f,fl,fl^) = T1 + T2 + T3 + T4, with 



Ti = 



A = 



[ f-[f 



T2 = - 



|J7^| 



/ 

JS-Qs JS 



f- / , ^4 = -T 



LA^milA 
(LJ 



l<5| - |o| Vis- 



2 

/) ■ 



is-n I'S'I — \ fls\ 

Using (ZB and passing to the limit Z\(|I7|) — >■ 0, i.e. |l7a| ~ |I7|, we obtain 
(higher order terms are neglected) 



Ti = -n= /^ 

J — f2 



= ///-p 



74 = 



1^1 Jfis-n J n 
2 



— f2 



f + 



|I7|^ 



f f - 



1 



I'S'I - 1^1 Jns-n Js-a I'S'I - W\ Vir^a-r? 

2 



' I 

f2 
2 

/‘ 



(22) 



I 



i\s\-m^jo,-o \Js- 



/ 



Define the image moments mo = / I,mi= / f , Kq = / II,7fi= / / . 

Jn Jn Js Js 

Using the mean value theorem for double integral, which states that if / is conti- 
nuous and A is bounded by a simple curve, then for some point sq in A we have 
/a = /(so) • 1^1 where | A| denotes the area of S, it follows that 



Z\Ed(/,U,U°) = 



mf {K\ — mi)^ 



mg (A'o-mo)2jy„ 



/ ” 



+ 



2mi ^ 2(A'i — mi) 



1 1 
-f 



mo Ko — mo 






(/ *)^ 

\J Qs-O J 



mo i^o — mo 
2 



/(so) 



(23) 
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Abstract. The calculation of salient structures is one of the early and 
basic ideas of perceptual organization in Computer Vision. Saliency algo- 
rithms aim to find image curves, maximizing some deterministic quality 
measure which grows with the length of the curve, its smoothness, and 
its continuity. This note proposes a modified saliency estimation me- 
chanism, which is based on probabilistically specified grouping cues and 
on length estimation. In the context of the proposed method, the well- 
known saliency mechanism, proposed by Shaashua and Ullman jSIIHH!, 
may be interpreted as a process trying to detect the curve with maximal 
expected length. 

The new characterization of saliency using probabilistic cues is concep- 
tually built on considering the curve starting at a feature point, and 
estimating the distribution of the length of this curve, iteratively. Dif- 
ferent saliencies, like the expected length, may be specified as different 
functions of this distribution. There is no need however to actually pro- 
pagate the distributions during the iterative process. 

The proposed saliency characterization is associated with several advan- 
tages: First, unlike previous approaches, the search for the “best group” 
is based on a probabilistic characterization, which may be derived and 
verified from typical images, rather than on pre-conceived opinion about 
the nature of figure subsets. Therefore, it is expected also to be more re- 
liable. Second, the probabilistic saliency is more abstract and thus more 
generic than the common geometric formulations. Therefore, it lends it- 
self to different realizations of saliencies based on different cues, in a 
systematic rigorous way. To demonstrate that, we created, as instances 
of the general approach, a saliency process which is based on grey le- 
vel similarity but still preserve a similar meaning. Finally, the proposed 
approach gives another interpretation for the measure than makes one 
curve a winner, which may often be more intuitive to grasp, especially 
as the saliency levels has a clear meaning of say, expected curve length. 



1 Introduction 

The human visual system (HVS) is capable of filtering images and finding the im- 
portant visual events so that its limited computational resources may be focused 
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on them and used efficiently. This discrimination between the important parts 
of the image, denoted “figure” , and the less important parts, denoted “backgro- 
und”, is done before the objects in the image are identified, and using general 
rules (or cues) indicating what is likely to be important pWer5()| . 

Presented, for example, with a binary image containing points and/or curves 
(such as those resulting from edge detection), it turns out that this perceptual 
process prefers to choose for figure, a subset of points lying on some long, smooth 
and dense curve. 

To account for this phenomenon with a computational theory |Ma,r82j , Shaas- 
hua and Ullman suggested a particular measure, denoted saliency, that is a parti- 
cular quantification of the desirable smoothness and length properties. They have 
shown that indeed, the image subsets, associated with high saliency are those 
considered as more important by common human subjective judgment [FTTMj . 
One important advantage of this computational theory is that this global opti- 
mization may be formulated as a dynamic programming task and consequently 
may be carried out as an iterative process running on a network of simple pro- 
cessors getting only local information. This makes the theory attractive because 
the proposed process is consistent with common neural mechanisms. 

The saliency measure of IS U 881 was re-analyzed recently as well, revealing 
some deficiencies. A generalization, stating that every saliency measure which 
satisfy some conditions set in PE3SI, can be optimized in the same way, was 
suggested in |A M hSj . Other measures of saliency, based on non-iterative local 
support jOJVlhdj . eigenvectors of an affinity matrix |SH98j and stochastic models 
for particle motion fW.lhKj were suggested as well. A survey on different saliency 
methods is described in FTMj . The aim of the work on saliency remains to 
explain perceptual phenomena such as Figure from Ground abilities and illusory 
contours perception, but also to provide a computer vision tool for intermediate 
level sorting and filtering of the image data. Work on Figure from Ground discri- 
mination such as |HH93IHvdH93MLM| . do not address explicitly the saliency 
issue but, implicitly, calculate a (binary) saliency as well. 

Having its origin in an attempt to explain a perceptual phenomena, most 
of the work on saliency does not emphasize the justification for the HVS pre- 
ference of long smooth curves, ft just tries to find a computational mechanism 
that produces such preference. Note that the particular saliency measure propo- 
sed in |SU88| is one particular quantification of the intuitively phrased desired 
properties, ft may be (slightly) modified (by say, replacing the curvature value 
by twice its value), yielding a measure which is as plausible and computationally 
efficient, but leading to a different choice of the most salient curve. 

For perceptual modeling, the “best” saliency measure may be decided by 
psychophysical experimentation. For computer vision applications, however, op- 
timizing the saliency measure requires first to agree on a quantitative criterion. 
The initial motivation of this work is to provide an interpretation and another 
justification of the original saliency concept. 

We show here how saliency like measures may be derived within a more 
general framework, namely the quantification of grouping reliability using pro- 
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babilities. The method is conceptually built on considering the curve starting at 
a feature point, and estimating the distribution of the length of this curve, itera- 
tively. Different saliencies, like the expected length, may be specified as different 
functions of this distribution. Although central to the explanation of the method, 
there is no need, in practice, to propagate the actual distribution during the ite- 
rative process, which indeed would have required a substantial computational 
effort. 

The proposed view and corresponding algorithm is different than that con- 
sidered in jSUSiSj (for example, with regard to the treatment of virtual (non- 
feature) points), but it shares the iterative dynamic programming like algorithm. 
When phrased in terms of our algorithm, the original saliency of |SUSiS] corre- 
sponds to a curvature/distance based grouping cue. Maximizing it at a point 
corresponds to maximizing the expected length of the curve on which this point 
lies. This way, the traditional saliency measure gets a different interpretation, of 
looking for objects associated with maximal expected perimeter. 

The new characterization of saliency using probabilistic cues is associated 
with the following advantages: 

1. reliability — Basing the search for the “best group” on the probabilistic 
characterization, which may be derived from typical images, (using ground 
truth), rather than on pre-conceived opinion about the nature of figure sub- 
sets is expected to give better choices of significant groups. 

2. generality — the probabilistic saliency is more abstract and thus more ge- 
neric than the original geometric formulation. Therefore, it lends itself to 
different realizations of saliencies based on different cues. To demonstrate 
that, we run the same saliency method with two different cues: low curva- 
ture and grey level similarity. 

3. another perspective — consider the SU saliency not only by its original 
curvature based interpretation, but also by its probabilistic interpretation 
gives another interpretation for the measure than makes one curve a winner, 
and may often be more intuitive to grasp, especially as the saliency levels 
has a clear meaning of say, expected curve length. 

The paper continues as follows. First, in section El we present the length dis- 
tribution concept, and show how different saliency measures may be built upon 
it. The proposed saliency process is described in sectional where we consider 
the iterative calculation, some shortcuts allowing not to calculate or to keep the 
actual distribution, convergence issues, and the formulation of the SU saliency as 
an instance of the new saliency. Some experiments, demonstrating the different 
types of saliencies resulting from the proposed saliency algorithm, are described 
in section E] 

2 Probabilistic Saliency 

2.1 Length Distributions 

Let Xi be a directional feature point in the image (e.g. an edgel). Such a point 
may or may not belong to some curve which extends length units to one 
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side and I- length units to the other side. Here we consider these lengths as 
random variables associated with the feature Xi, and characterize them by the 
distributions D*|_(/) and D!_(^), respectively. The meaning of treating the lengths 
as random variables is discussed below. The direction, used to keep the order in 
the curve, is specified relative to say, the direction of the gradient at this feature 
point, and may take one of the two {+, — } values. The parts of the curve lying in 
the positive and negative directions are denoted positive and negative extensions, 
respectively. Our basic intuition is that points with long extensions, correspond 
to larger objects and deliver more significant information about the content of 
the image. Therefore we shall try to find those feature points associated with 
the D®|„(Z) and D!_(Z) distributions, which put more weight on longer I values. 

When no connectivity information is available, all features are not known 
to belong to any curve. Then, all distributions are concentrated on very short 
lengths, corresponding to the length of the corresponding feature themselves. 
For simplicity, we assume that all these initial distributions are identical and 
denote this initial distribution by D*(Z). 

2.2 Length Distribution Update Rules 

Consider two features, Xi and Xj, which belong to some curve, such that Xj lies 
in the positive extension of Xi. Suppose that D-^(Z) is known. Then, can 

be written as 

= = ( 1 ) 

where Uj is the distance from Xi to Xj (on the curve). This follows by observing 
that a positive extension of length I associated with Xj implies that the length of 
the positive extension of Xi is I + hj. The notation explicitly emphasizes 

that this is an inference of the length distribution associated with the i— th 
feature from the known distribution associated with the j— th feature. 

In the common situation in image analysis, we can never be sure that two 
features lie on the same curve. In a non-model-based context, we can only esti- 
mate the probability for this event based on local information such as perceptual 
organization cues jl ,owS,6j . Let c{xj) denote the curve on which Xj lies and let 
Pij be the probability Prob{xi € c{xj)}. This probability, denoted as “the grou- 
ping cue” is expected to be inferred from perceptual information. Specifying 
the affinity value between the two feature points Xi and Xj, in this probabili- 
stic abstract way, allows to calculate a saliency like measure, based on different 
grouping cues and not only on the co-circularity cue used in [SU88] . As we shall 
see, this probabilistic formulation provides a common meaning for the different 
saliencies associated with the different cues, independently of the different types 
of information they employ. 

The cue value Prob{xi G c{xj)} may be regarded as a characteristic of a 
binary random variable determining whether Xi and Xj are connected. Consider 
a path P = {xi,X 2 , . . . ,xn} starting at the feature point x\, such that Xi+i is 
on the positive extension of Xi, i = 1, N — 1. The length of the connected 
path which starts at x\ depends on the outcome of all binary random variables 
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characterizing the connectedness of the pairs Xi^Xi+\. Therefore, this length 
may be considered as a random variable itself. The distribution of this random 
variable characterizes the support that Xi gets from its positive extension. 

Consider now an algorithm trying to find a path between the feature points, 
and some hypothesis about a particular path, in which the feature Xj lies on the 
positive extension of the feature Xi. If the length distribution Di^|_(Z) is known 
then the expected value of the length distribution D!|_(Z), is 

- hj) + (1 - ( 2 ) 

Note that D'?[r^*(Z) is not the expected length of the positive extension length 
(which is a scalar), but rather, an expected distribution (out of the simple dis- 
tribution of length distributions specified by the random connection between Xi 
and Xj), which is a distribution itself. Note also that this is an estimate of the 
length distribution of the positive extension of Xi, under a particular hypothe- 
sis regarding the path. The possibility that Xi belongs to some other path (or 
curve) which does not contain Xj is not taken into account. Therefore the only 
options for Xi are either to be connected to this curve or to be disconnected from 
anything (in the positive direction). An alternative formulation, where all curves 
to which Xi may belongs are taken into account, leads to a Bayesian estimate of 
D*|_(Z). See section El for a discussion of this alternative and its relation to the 
saliency like approach developed in |W.T96j . 

Suppose now that a path F = {xi,X 2 , ■ ■ ■ ,xn} starts at the feature point 
xi, such that xt+i is on the positive extension of a;^, z = 1, . . . , iV — 1. Then, the 
length distribution associated with x\ may be recursively calculated: D(^(Z) = 
D*(Z), D(|(^“^(^) = . . . , until D+(/) is finally estimated. A distribu- 

tion estimated this way, from a path of length N , is denoted (when we we want 
to make it explicit), D^^(Z). 

2.3 Probabilistic Saliency 

Let Q[D!|_(Z)] be a (scalar) quality measure computable from the length dis- 
tribution, and quantifying, in some way, the desired property of a long curve. 
Typical measures may be the average length or other moments. This measure 
serves as a one-sided-saliency, and we shall look for features points maximizing 
it and for curves containing such points. Note that every feature point is asso- 
ciated with two one-sided saliencies, corresponding to the two directions. Some 
possible choices for the saliency are 

Maximum one-sided expected length — A straightforward saliency mea- 
sure is the expected value of the extension length random variable, which is 
denoted expected length and is easily calculated from its distribution. 
Maximum two-sided expected length — Unless the curve is close and 
very tightly connected, maximizing the expected length in the two directions 
is done independently for the two sides. Then, the sum of these one-sided 
saliencies at a point is just the expected length of the curve on which the 
point lies. 
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Maximum confidence one-sided curve — Some common object recogni- 
tion process, which rely on curve invariants, need some continuous curve 
from the object. In such scenario, some reasonably long curve associated 
with high reliability is preferred over a longer curve with lower reliability. 
Here, the preferred curve is characterized by a distribution concentrating 
around one value, in contrast to an uncertain estimate, characterized by a 
closer to uniform distribution. 

In the rest of this paper, the one-sided expected length saliency is usually 
used, although one example, demonstrating the advantages of the confidence 
emphasizing approaches is considered in the experiments. The expected length 
is the measure corresponding to SU saliency and its interpretation is simple and 
clear. We shall also see that it has algorithmic advantages. 

For feature points on closed curves, the meaning of the saliency as expected 
length is distorted, because the length of points of the curve is counted twice 
or more (after a sufficient number of iterations). The increase of the saliency 
of close curves is often considered desirable because closer curves have usually 
higher significance over their open counterparts with the same length |?TU88| . 
Calculating the expected length for closed curves can be done using the technique 
described in |AH98| and shall not be repeated here. 



3 The Probabilistic Saliency Optimization Process 



The aim of the optimization process is, for every feature point, to find a path, 
starting at this point and maximizing the saliency of that point (calculated 
relative to this path) . 

(We should mention here that the proposed method is similar, in principle, 
to that proposed by Shaashua and Ullman (see |SU88IAHh8| l. and is brought 
here only because some details differ (due to the use of distributions) and for 
completeness. We tried to use similar notations when possible. The calculation 
of saliency in the sense of ISU88I . for a sparse set of feature points (i.e. without 
virtual feature points) was considered also in |AM98j .l 

Calculating this optimum is easy for short paths (e.g. IV = 1, 2) but is gene- 
rally exponential in N. Fortunately, it may be calculated by a simple iterative 
process using dynamic programming if the quality criterion (or saliency) is ex- 
tensible fSU88j. That is, if the saliency associated with the best (length N) path 
starting from Xi satisfies 



Q[D!,_^(Z)] = maxjF 



+ (iV-l) 



where qij is a quantity calculated from the feature points Xi and Xj, 
is the distribution associated with the best (length — 1) path, associated 
with the highest saliency, starting from Xj, and the maximization is done over 
all neighbors Xj of Xi. Note that this condition is a bit more general that that 
suggested in EIIHH], as the new saliency calculation may use the distribution and 
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not only a function of it. In fact, all information about the best path may be used 
as well, as the more general condition is that the optimal solution contains within 
in optimal solutions to subproblem instances |(lLI{f)()| . Note that the expected 
length is an extensible quality criterion. 

3.1 The Iterative Process 
Preprocessing: 

A neighborhood is specified for every feature point. 

At the fc-th iteration (fc = 1, 2, 3, . . .) 

For every feature point Xi 

1. For all neighbors Xj j = 1,2,3... of Xi 

a) calculate the grouping cue Pij. 

b) update the length distribution using eq. Q, and calculate 

2. Choose the neighbor Xj maximizing the quality measure and update the 
length distribution to 




Fig. 1. The one sided length distribution at the point C 3 (in the top left illustration), 
plotted for 1,2, 3, 4 and 10 iterations. Note that for such a smooth curve (a straight line 
segment), the distribution quickly develops a significant weight for the large values. 



The procedure starts when all feature points are associated with the basic 
distribution D^(/). For saliencies prefering long curves, the process behaves as 
follows: At the first stage, every feature point Xi chooses the best perceptually 
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connected neighbor Xj, so that Pij is maximal for it, “improves” its distribution, 
and increase its saliency. At the next iterations, the prefered neighbor is chosen 
not only by its perceptual affinity but also by its own saliency, as generated in 
the previous iterations. See Figure ^ illustrating the development of the length 
distribution associated with a particular point and Figured which describe some 
(roughly) stable distribution obtained after many iterations. 




Fig. 2. The left graph (a) describes some distributions corresponding to the different 
points Cl, . . . ,Cr (in the previous Figure) after 80 iterations. Note that points which 
are close to the end (Ci is the closest) cannot develop large value, and correspond 
to the distributions with peaks on small I values. The point A is gets support from 
a smooth curve and is associated with a distribution having significant weight in the 
high values (b). The point B is weakly connected to A, and therefore, its distribution 
is an average of the initial distribution, focusing on low values and that of A, which 
makes its roughly bimodal (c). 



Apart from building the length distributions, the process also specifies, for 
every feature point, the next feature point on its extension. Thus, starting from 
salient points, the iterative process finds also the long, well connected, curves 
which contributed and supported this high saliency. 



3.2 Shortcuts 

Apparently, one deficiency of the proposed saliency is the need to update a 
length distribution for every feature point, which is costly in time and space. 
To alleviate this problem we suggest to store and update only the statistics 
required to calculate the prefered saliency. For example. For calculating the 
expected length quality measure, let E*[l] be the expected length associated 
with the distribution D^(^). Then, the distribution update rule is changed to 
the following expected length update rule 

E[l]\ = + E[iy:,) + (1 - P,,)E*[l] 

Other statistics (e.g. variance) may be propagated similarly, and there is usually 
no need to propagate the entire 
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3.3 Optimality and Convergence 

By the same arguments made in standard dynamic programming and in |S I J iSiS] , 
after N iterations, the length distribution of the f-th feature is associated with 
the maximal saliency. The maximum is over all possible curves of length N star- 
ting in the i-th feature point. This optimization happens for all feature points 
simultaneously. One (or more) of them will also achieve the global saliency mea- 
sure. Therefore, the process finds also the maximal quality curve, as measured 
by the saliency of its endpoint. 

After N iterations, all the open paths of length N or less, which start at Xi 
make their maximal contribution. If N is set as the number of feature points in 
the image, then the process should converge after N iterations. The exception 
is of course closed curves, which are equivalent to infinite chains. We show now 
that even for closed curves the length distribution converges. The proof takes 
follows some principles from EESHi. 

Consider, for example, a feature point on a closed path of length N^. Let 
this point be the i-th point and let the direction be such that this *-th point 
updates its distribution based on the {i + l)-th point. Until the 7Vc-th iteration, 
the closure does not effect the distribution associated with the feature point. At 
the Nc-th iteration, the saliency of the f-th point may be written as 

= (1 - p,..+i)D*(0 + 

= (1 ~ Pz,i+l)D*(0 + Pi,i-|-l(l ~ Pi+l,i+2)'D*{l — k,i+l) 



— (1 ~ Pf=0 ^P+J,j+j+l)D*(0 + n^=o ^Pi+j,i+j+l^+oi^ - J2f=0 ^ h+j,i+j+l) 
= {l-a)t)*{l)+aBXo{l- L) 

= {l-a)T>*{l)+aT>X{l-L) 

( 3 ) 

D*(Z) is an average distribution of D*(Z), D*(^ — D*(Z — — . . . 
(with non-equal coefficients), L = ^ h+j,i+j+i, and a = 

Note that while the distribution of the f-th feature point is no longer the initial 
distribution, this update is not reflected yet in the way it supports itself through 
the closed curve. From the next iterations however, the change of the Ath fea- 
ture point histogram will be reflected in this support, and after W additional 
iterations the histogram will change to 

DW(0 = (1 - «)D*(0 + - L) ... 

= (1 - a)D*(Z) -b a[(l - a)T>*{l - L) + «D;(Z - 2L)] ’ 

After K • Nc iterations, 

K 

= (1 - a) ^ a'=D*(Z - kL) + o^D;(Z - KL) (5) 

k=0 



Consider now any finite moment or order m associated with the length dis- 
tribution. Note that D^(^ — kL) (and D*(Z — kL)) has zero weight on any length 
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I higher than kL. Therefore, after the KNc-the iteration, this moment, denoted 
^+KNc’ bounded: 



M 



+KN^ 



< (1 - a) ELo + a^{KL)^ 

= (1 - a)L™ J2k=o 



( 6 ) 



For any reasonable cue, a = is strictly smaller than one and 

the bounds on the moments converges. The moments themselves are increasing 
with K, and therefore converge, and hence the distribution converge. 



3.4 Relation to the Original SU Saliency 

The original saliency measures, proposed in |SU88j . meant to mimic the human 
visual system (HVS) behavior and to model the priority it gives to long smooth 
curves, even when they are fragmented. Our approach, on the other hand, is 
based on a statistical characterization of grouping cues, which is believed to be 
available. It is well known that the HVS is very successful in grouping tasks, 
therefore, the statistics of grouping cues must have been learned and incorpora- 
ted into its grouping mechanisms. Thus, it is expected that our method will also 
give results, which are compatible with the HVS preferences. For cues based on 
co-circularity, which is the principle used in |SlJ88j . the results of both methods 
are expected to be similar. 

We shall show now that in the context of curvature/distance based cue, the 
SU algorithm corresponds to an instance of our algorithm: The saliency of the 
i— th feature, specified in ^U88| is updated by the local rule 

^(n+i) ^ p^niax(£'j"V*t) 

where the maximum is taken over all the features in the neighborhood of the 
z— th feature, and 

(n) 

- El is the saliency of the z-th feature after the n— th iteration, 

- (Ti is a “local saliency” which is set as a positive value (e.g. 1) for every real 

feature, 

- Pi is a, penalty for gaps which is set to one in features (no gap) and to a lower 

value when the feature is virtual. Finally, 

- fij is a “coupling constant” which decreases with the local curvature. 

In the framework of |SU88j features could be “real” (where we have, say, an 
edge point), or “virtual” where there is no local image based evidence for an 
edge. This choice allows to hypothesize an image independent andlglg parallel 
local architecture which is a plausible model for a perceptual process. A virtual 
feature does not add to saliency and therefore is associated with null tJi. It 
should also attenuate the currently existing saliency and is therefore associated 
with lower than one pi parameter. 

In our framework, all features are real. For them, the co-circularity may be 
interpreted as a measure for the grouping probability: by the general assump- 
tion that smooth curves are likely, a low curvature implies that connection is 
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more probable then high curvature. Thus, for real feature points, the SU update 
formulae may be interpreted as 

Ein+i) ^ ^ ^ niax(£;j"^Py)- (7) 

The ability to continue the curve over gaps is interpreted as follows: Suppose 
that the three feature points Xi,Xj,Xk are consecutive along the curve, are cho- 
sen as such by the SU algorithm, and let Xj be a virtual feature point. Then, 
the SU saliency of Xi is (roughly) = 1 + = 1 -b fijfjkPkE^k~^'’ = 

1 + PikE^ Thus, the effect of a missing point may be replaced by a lower 
probability Pi^k = fijfjkPk- The probability of the feature point Xi to be part 
of the curve c{xk) on which Xk lies, is indeed lower when there is a gap bet- 
ween Xi and Xk- Moreover, the process of calculating a cue between two distant 
points may be considered as an explicit search for a path between them, which 
minimizes a cost function. 

Recall now (from section IHJ that the expected length propagates as 

E[l]X = (1 - P^j)E*[l] + + E[l]\) 

= E*[l] + P,,{k,-E*[l]+E[l]\) ^ > 

which, for inter-pixel distance of Uj equal to the expected length of one edgel 
E*[l], and both equal to 1, yields 

E[l]\ = 1 + P,,E[l]^ (9) 

Therefore we conclude that the co-circularity and the gap attenuation, used in SU 
saliency, may be interpreted as measures of the grouping probability used here, 
and that the overall saliency maximized there, is, according to this interpretation, 
the expected length. 

There are also other differences, but they are technical, and result from our 
use of directional feature points, implying that we can work with the actual 
features, and not with the arcs between the features as done in jSUSRj . 



4 Implementation and Experiments 



In contrast with [ISU8S|AR98| . we considered only real (non- virtual) feature 
points. They were oriented using the gradients direction. The positive (negative) 
extension neighbors of every feature point were all (real) neighboring feature 
points, s.t. the vector XiXj is making an angle in [7r/6,57r/6] ( [— 57 t/6, — tt/ 6]) 
with the gradient. The neighborhood was usually a disk of radius 10 pixels. The 
initial length distribution was set to have equal weights on the values 0, 1 and 2. 



Traditional co-circularity cue. First we considered the classical cue, using 
curvature (or weighted angle differences). Following we set the cue as 

Pij = explll^ij 11^/50} • exp{—ta.n{GradAngleDiff/2)}, 
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where GradAngleDif f is just the difference between the two gradient angles 
in the two points. Note that as the distances between a feature points and 
its neighbors is no longer constant, we added a preference to short distances. 
Interestingly this dependency needs to reduce the cue faster than exp{||a:i_,j|} 
because otherwise the process always prefers the far neighbors. (Going to that 
neighbor through another, closer neighbor, gives a lower expected length, which 
follows directly from the update rules.) 

We view this experimental work as an intermediate stage, because the actual 
probabilities are not those determined by this parametric form. Our current work 
focuses on measuring these cues empirically. 

Here (Figures I.SI4I are two examples of the implementation. They include 
the original image (synthetic and real), the edge points detected with standard 
DRF (Khoros) operator, the two one-sided saliencies and their sum, and the 
thresholded saliency. Note that the saliency image has a concrete meaning: it is 
the expected length on which the point lies. For the one sided case for example, 
if one starts from a point associated with saliency of 38 (a typical value for the 
strong curves, on say, the lizard back), he can expect to find about 38 neighbors 
on the curve in one of the directions. 




Fig. 3. A typical saliecy calculation with an angle cue done on a heavily corrupted 
noise: (starting from upper left, clockwise) The original image, edges, positive and 
negative saliencies, sum of saliency, thresholded saliency. 



Saliency with a Grey Level Cue. Next we took the same saliency process 
and just changed the cue, which now, measure the similarity in grey levels and 
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not the smoothness of the curve. Specifically, we set 

Pij = exp{\\Xij\\ / 50 }^ SOGreyLevelDifp ~ ' 

' GradSize{i)GradSize{j) 

The GreyLevelDif f is the difference in grey levels between the two feature 
points, and GradSize(i) is the gradient size at Xi. See FigureEl Note that most 
unwanted additions to the thresholded saliency image are in inner points where 
the grey level is similar and random high gradients exist. Note also that the 
saliency value has the same meaning: expected length of the curve (either to one 
side or to both) . Actually, the results were better than we expected and in a sense 
outperform the use of the angle-based cue. We intend to investigate this issue 
farther and with real images as well. To conclude, this experiment demonstrates 
that a saliency process which is similar, in principle, to that proposed in |SU88] , 
can work also with other sources of information. 




Fig. 4. Saliency calculation for a real, complex, image: (From upper left, clockwise) 
original image, edges, sum of positive and negative saliencies, two length distributions 
associated with a stone point (dark) and the lizard back (lighter), and thresholded 
saliency. 



A Saliency measure emphasizing confidence. It may happen that a relati- 
vely weakly connected sequence of edgles will yield a substantial expected length 
(or SU) saliency. Indeed this was the case, for example, in the real “lizard” image, 
where many points on the texture were associated with large saliency. Thus, a 
quality measure emphasizing the connectedness over the long length may be 
preferred. One such measures is the “expected square root length”, specified 
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as which prefers shorter curves associated with higher confidence: 

consider, for example, two length distributions, one giving a full weight to the 
length I = 10 and another sharing the weight between I = 0 and I = 20. While 
the expected length associated with the distributions is identical, the expected 
square root length clearly prefers the more “concentrated” distribution where 
full confidence is given to the Z = 10 value, and give it a saliency value of VTO 
which is larger by a factor of than the saliency associated with the other 
distribution. Indeed we found that such saliency may have advantages when 
working on real images (see Figure EJ. 

Note however that such saliency has one severe theoretical deficiency: it is 
not extensible, and thus global maximization is not guaranteed. 




Fig. 5. A typical saliency calculation with an Grey Level cue done on a heavily 
corrupted noise: (starting from upper left, clockwise) The original image, edges, positive 
and negative saliencies, sum of saliency, thresholded saliency. 



5 Discussion 

This note presented a framework and an algorithm for calculating a well defined 
saliency measure which is based on estimating the length distribution and the 
expected length of curves. The work was motivated by the SU saliency ISIIHHI, 
which, in our opinion, was build on good principles but lacked in interpretation, 
at least for computer vision practitioners. One result of the proposed work is 
that, when interpreting high curvature and gaps as factors, which decrease the 
probability to connect, then the SU saliency calculates the expected length of 
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the curve on which every pointy lies. This is of course in agreement with [A Hh8| 
where the saliency of a straight line of length I and no gaps is found to be /. 

The work is now in progress and we are exploring many interesting issues 
related to the proposed saliency mechanism. One interesting question is whether 
we can make the saliency invariant to scale (at least in the sense that the ratio 
between saliencies of two different curves stay the same over scale), and thus 
solve one of the problems raised in [A hih8) . This is possible in principle because 
we are no longer limited to the curvature cue but can design other cues as well. 
An even more interesting question is whether there is extensible useful saliency 
quality function of the distribution, which is different than the expected value 
(or weighted expected value). The variance, for example, is not such a function, 
because it is not necessary that the path of length N associated with, say, the 
least variance, contains a path of length A^— 1 associated with the lowest variance 
as well. 




Fig. 6. Comparison between the expected value saliency (left) and the square root 
expected saliency (right). Both saliencies were thresholded so that only the points 
associated with saliency in the top 10 % are marked. The square root measure pre- 
serves more un-fragmented figure and contains less background texture (although the 
differences are not that large). This was the case also for other thresholds. 



The claimed added advantage of higher reliability is not fully proved yet 
in this paper. Our current goal is to develop methods for characterizing the 
probability Pij empirically and for constructing cues which are associated with 
a higher reliability than simply measuring the curvature. We expect to gain in 
the overall reliability when such cues are constructed. 

The interpretation of cues as probabilities was considered in [W.lhtij . where 
the stochastic motion of a particle was used to model completion fields and eli- 
cits a saliency process as well (as observed in [WT98| 1. The saliency induced 
by this process is different than that suggested in |STTRRj mainly because it is 
not associated with a single “best” curve but with some average of all curves 
in the image. Interestingly, a modified form of the proposed saliency form may 
be created by updating the length distribution not according to the best curve 
but according to the average of all curves with weights, which are just the cor- 
responding probabilities. This way we get an alternative estimate of the length 
distribution (and the expected length). Which one is better? As we see it, the 
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saliency method that we proposed here, (and that of jSlJ88j ~l is a maximum li- 
kelihood approach to saliency and length estimation, because it calculates the 
saliency relative to the best parameter. (This “parameter” is a path in this case). 
The second approach is essentially Bayesian and allows to get contributions from 
many alternatives. Note that both methods can be used to calculate the expec- 
ted length estimate. We actually expect the second, Bayesian, method to give 
more visually pleasing saliency plots. Observe however, that it does not provide 
an estimate of the best path with it. 

6 Acknowledgments 

This research was supported by the fund for the promotion of research at the 
Technion. 



References 



[AB98] 

[AL98] 

[AM98] 

[CLR90] 

[GM93] 

[HH93] 

[HvdH93] 

[Low85] 

[Mar82] 

[SB98] 

[SU88] 

[WerSO] 

[WJ96] 

[WT98] 



T.D. Alter and R. Basri. Extracting salient curves from images: An analysis 
of the saliency network. IJCV, 27(l):51-69, March 1998. 

A. Amir and M. Lindenbaum. Ground from figure discrimination. In 
CVPR98, pages 521-527, 1998. 

Laurent Alquier and Philippe Montesinos. Representation of linear struc- 
tures using perceptual grouping. In Presented in The 1st workshop on 
Perceptual Organization in Computer Vision, 1998. 

T.H. Gormen, G.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. 
MIT Press, 1990. 

G. Guy and G.G. Medioni. Inferring global perceptual contours from local 
features. In CVPR93, pages 787-787, 1993. 

Laurent Herault and Radu Horaud. Figure-ground discrimination: A com- 
binatorial optimization approach. PAMI, 15(9):899-914, Sep 1993. 
Friedrich Heitger and Rudiger von der Heydt. A computational model of 
neural contour processing: Figure-ground segregation and illusory contours. 
In ICCV-93, Berlin, pages 32-40, 1993. 

David G. Lowe. Perceptual Organization and Visual Recognition. Kluwer 
Academic Publishers, 1985. 

D. Marr. Vision: A computational investigation into the human represen- 
tation and processing of visual information. In W.H. Freeman, 1982. 

S. Sarkar and K.L. Boyer. Quantitative measures of change based on fea- 
ture organization: Eigenvalues and eigenvectors. CVIU, 71(1):110-136, July 
1998. 

Amnon Sha’ashua and Shimon Ullman. Structural saliency: The detection 
of globally salient structures using locally connected network. In ICCV-88, 
pages 321-327, 1988. 

Max Wertheimer. Laws of organization in perceptual forms. In Willis D. 
Ellis, editor, A Source Book of Gestalt Psychology, pages 71-88, 1950. 

L.R. Williams and D.W. Jacobs. Local parallel computation of stochastic 
completion fields. In CVPR96, pages 161-168, 1996. 

L. Williams and K. Thornber. A comparison of measures for detecting 
natural shapes in cluttered backgrounds. In ECCV98, 1998. 



Layer Extraction with a Bayesian Model of 

Shapes 



P.H.S. Torr^, A.R. Dick^, and R. Cipolla^ 

^ Microsoft Research, 1 Guildhall St, Cambridge CB2 3NH, UK 
philtorr@microsof t . com 

^ Department of Engineering, University of Cambridge, Cambridge CB2 IPZ, UK 
{ard28 , cipolla}@eng . cam .ac.uk 



Abstract. This paper describes an automatic 3D surface modelling sy- 
stem that extracts dense 3D surfaces from uncalibrated video sequences. 
In order to extract this 3D model the scene is represented as a collec- 
tion of layers and a new method for layer extraction is described. The 
new segmentation method differs from previous methods in that it uses 
a specific prior model for layer shape. A probabilistic hierarchical model 
of layer shape is constructed, which assigns a density function to the 
shape and spatial relationships between layers. This allows accurate and 
efficient algorithms to be used when finding the best segmentation. Here 
this framework is applied to architectural scenes, in which layers com- 
monly correspond to windows or doors and hence belong to a tightly 
constrained family of shapes. 



Keywords: Structure from motion. Grouping and segmentation. 

1 Introduction 

The aim of this work is to obtain dense 3D structure and texture maps from an 
image sequence, the camera matrices (calibration and location) having been re- 
covered using previously developed methods mmiL\ . The computed structure 
can then be used as the basis for building 3D graphical models. This represen- 
tation can be used as a basis for compression, new view rendering, and video 
editing. A typical example sequence is shown in Figure Q and the computed 
model in Figured 

Although extracting scene structure using stereo has been actively resear- 
ched, the accurate recovery of the depth for each pixel remains only partially 
solved. For instance, one approach to the dense stereo problem is the voxel ba- 
sed approach HH in which the scene volume is first discretized into voxels, and 
then a space carving scheme applied to find the voxels that lie on the surfaces 
of the objects in the scene. The disadvantage of the voxel carving method is 
that the surfaces produced from homogeneous regions are “fattened” out to a 
shape known as the Photo Hull m Rather than generate voxels in 3D some 
algorithms operate in the image by testing different disparities for each pixel e.g. 
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Koch ei oZ jO] • The problem with these approaches are that they do not treat all 
the images equally and work well only for small baselines. 

Generally dense stereo algorithms work well in highly textured regions, but 
perform poorly around occlusion boundaries and in untextured regions. This is 
because there is simply not enough information in these untextured regions to 
recover the shape. In this paper we propose a general framework for overcoming 
this by the utilization of prior knowledge. 

A vehicle for encoding this prior knowledge is the decomposition of the image 
into layers Each layer corresponds to a surface in the scene, 

hence the decomposition of the scene into layers acknowledges the conditional 
dependence of the depths for adjacent pixels in an image. Detecting the different 
surfaces (layers) within the scene offers a compact and physically likely represen- 
tation for the image sequence. The main problem is that such a decomposition 
is difficult to achieve in general. This is because the parametrization of the layer 
itself is problematic. For each layer the parametrization is composed of three 
parts: (a) the parameteric form of the 3D surface giving rise to the layer, (b) its 
spatial extent within the image and (c) its texture map. Generally it is easy to 
construct the former e.g. in it is assumed that the surfaces are planar, 

in mni the surfaces are encoded by a plane together with a per pixel parallax, 
in m only smoothness of depth is assumed. The latter two however are more 
difficult to parametrize. One approach is to ignore the spatial cohesion alto- 
gether and simply model the whole image as a mixture model of the layers psi 
01 . Whilst this simplifies the problem of estimating the layers affording the use 
of iterative algorithms like EM, it is not a realistic model of layer generation e.g. 
a homogeneous region which contains little depth or motion information could 
be broken up in any way and assigned to different layers with no increase in the 
mixture model’s likelihood. 

A now classical method for modelling the spatial dependence of layer mem- 
berships of adjacent pixels is by use of Markov Random Fields (MRFs) |2|. 
There are several disadvantages with this approach: first is that using an MRF 
model leads to very difficult optimization problems that are notoriously slow to 
converge. Second, sampling from an MRF distribution does not produce things 
that look like images of the real world, which might lead one to think that using 
this as a prior is a bad idea. Third, the MRF is pixel based which can lead 
to artefacts. The MRF only implicitly defines the prior probability distribution 
in that the normalization factor cannot be readily computed. What would be 
preferable would be an explicit prior for the segmentation, which would allow 
more direct minimization of the error function, for instance by gradient descent. 

Within this paper a prior for the shape of the layers is constructed and illu- 
strated for architectural scenes. Architectural scenes are particularly amenable 
to the construction of priors, as layers will typically correspond to such things as 
windows or doors which for which an informative prior distribution can be con- 
structed (e.g. they are often planar with regular outline). Although architectural 
scenes are chosen to illustrate the basic principles the method proposed is re- 
presentative of a general approach to segmentation. Taking inspiration from [HI, 
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rather than using an implicit model for the prior probability of a segmentation 
an explicit model is defined and used. A solution to the final problem (c), that 
of finding the texture map, is also found by considering the texture as a set of 
hidden variables. 

The layout of the paper is as follows. The parameters used to represent the 
shape and texture of a scene are defined in Section |3 A posterior probability 
measure is also introduced here for estimating the optimal parameter values for 
a scene from a set of images and prior information. As shape is now represented 
parametrically, layer extraction becomes a problem of model selection, i.e. deter- 
mining the number and type of these parameters required to model the scene. In 
Section El a method is developed for choosing automatically which model is most 
appropriate for the current scene, based on goodness of fit to the images and the 
idea of model simplicity. Section E] then deals with the details of implementing 
this method. In Section|S|it is demonstrated that this technique can decide which 
individual shape model is appropriate for each layer, which overall model best 
fits the collection of layers in the scene, and how many layers are present in the 
scene, given a coarse initialisation. Concluding remarks are given in Section El 

2 Problem Formulation 

A scene is modelled as a collection of layers. A scene model has a set of parame- 
ters represented by a vector 6, which can be decomposed into shape parameters 
6s and texture parameters 6t such that 9 = 6s\J6t- Each layer is defined 
as a deformable template in three space; the shape parameters 9s comprise the 
location and orientation of each template together with the boundary (a variable 
number of parameters for each layer depending on which model M is selected) . A 
grid is defined on the bounded surface of each layer, and each point on this grid is 
assigned an intensity value forming a texture map on the 3D layer. The intensity 
at each grid point is a variable of 9t- The projection matrix to a given image, a 
noise process, and 9 provide a complete generative model for that image. Each 
point in the model can be projected into the image and the projected intensity 
compared with that observed, from which the likelihood of the model can be 
computed. If priors are assigned to the parameters then the posterior likelihood 
can be computed. 

Within this paper a dominant plane is assumed to fill most of the scene (such 
as a wall of the building), with several offset objects (such as windows, doors and 
pillars). An example of such a scene is given in Figure O The background plane 
Cq is modelled as the plane z = 0 with infinite extent (thus having no shape 
parameters) . The other layers Ci . . . Cm are modelled as deformable templates 
as now described. 

2.1 The Shape Parameters 

At present there are four types of layer model A1 available, which allow the 
modelling of a wide variety of architectural scenes. These are Ali a rectangle (6 
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Fig. 1. Three images of the type of scene considered. A gateway and two indentations 
are offset from the background plane of the wall. 



parameters), M 2 an arch (7 parameters), M 3 a rectangle with bevelled (sloped) 
edges (7 parameters), and M 3 , an arch with bevelled edges (8 parameters). The 
8 parameter model M 3 has position coordinates (x,y), scale parameters a and 
b, orientation uj, an arch height c, depth from the background plane d and bevel 
width r. The arch in M 2 and is completely specified by c as it is modelled 
using a semi-ellipse. The other layer models are constrained versions of this 
model, as shown in Figure |3 




Background 

Plane 




Frontal View 



Overhead view 



Fig. 2. Top and cross-sectional views of the most general shape primitive. The other 
primitives are special cases of this one — the non-bevelled arch has r = 0, the bevelled 
rectangle has c = 0 and the non-bevelled rectangle has r = 0,c = 0. The coordinate 
axes shown for each view are translated versions of the 3D world coordinate system. 



Layers in architectural scenes are highly constrained not only in their indi- 
vidual shape, but also in their spatial relationship to each other. Hence a single 
parameter can often be used to represent a feature common to several primiti- 
ves, such as the common y position of layers belonging to a single row. These 
global parameters are known as hyperparameters j^], as the entities that they 
model are themselves parameters. The introduction of hyperparameters makes 
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the model hierarchical as illustrated in the directed acyclic graph (DAG) Figure 
0 The hyperparameters defined in Tableware later used to represent our belief 
that primitives occur in rows, but there are many other possibilities. 



Hyperparameters dx y a b d w c r 



Individual Shape Parameters x^...^^ yy-y, a^...a, d^...d^ w^...w, Cy.c, 



* 

Data Image pixel intensities 

Fig. 3. The hierarchical shape model. Hyperparameters model functions of the indivi- 
dual shape parameters. The camera projection matrices, and lighting conditions, could 
also be modelled as hyperparameters based on the data and shape parameters, but in 
this paper they are given as prior information. 






To sum up, architectural scenes containing a background layer Cq, together 
with a set of offset layers Ci,i = 1 ... to are to be modelled. Each offset layer has 
an associated model My, j = 1 .. .A. The individual shape parameters and the 
hyperparameters together define the shape of the model, and can be represented 
as a parameter vector 6s] next the texture paramters Op are defined. 

2.2 Texture Parameters 

The set of layers defined above define a surface. Next this surface is discretized 
and a two dimensional coordinate system defined on it. At each point X on this 
surface an unknown brightness parameter f(X) (between 0 and 255) is defined. 
These brightness parameters form the texture parameter vector Op. 

2.3 Evaluating the Likelihood 

The shape parameters (number of layers and their associated parameters) and 
the texture parameters give the total parameter vector 6. In order to estimate 
this its posterior probability must be maximized: 

p{6\Bl) = p{-D\91)p{9\l) (1) 

where I is the prior information (such as the camera matrices etc.) and D is the 
set of input images. This is a product of the likelihood and prior. To perform 
the optimization gradient descent is used. This would prove prohibitive if all 
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Table 1. Example set of hyperparameters. Knowledge about overall scene structure can 
be imposed by assigning a probability distribution to each hyperparameter. 



dx 


The spacing of x-axis position of the primitives 


y 


The y-axis position of the primitives 


a 


The horizontal scale of the primitives 


b 


The vertical scale of the primitives 


d 


The depth of the primitives 


<jJ 


The orientation of the primitives 


c 


The arch height of the primitives 


r 


The bevel width of the primitives 



the paramaters had to be searched simultaneously. Fortunately the task can be 
decomposed into several easier optimizations: first the shape parameters of each 
layer can be optimized independently, second only the shape parameters need 
to be optimized explicitly. It is now shown how to estimate the optimal set of 
texture parameters given these shape parameters. 

Given the shape parameters and projection matrices, it is now assumed 
that the projected intensity of X is observed with noise z(X) + e, where e 
has a Gaussian distribution mean zero and standard deviation The pa- 
rameter f(X) can then be found such that it minimizes the sum of squares 
mini(x) (*(^-^) - *(^)) where i(x-l) is the intensity at x-l, and x-1 is the 

projection of X into the jth image. The likelihood for a given value of f(X) is 



p(D|z(X)) 



n 



\f^0 



■ exp - 



*(x^) - *(X) 



( 2 ) 



Using Equation Q under the assumption that the errors e in all the pixels 
are independent, the likelihood over all pixels can then be written: 



p(Di0r0si)=nn 

i 3 



1 1 

, — exp — - 

2 






( 3 ) 



where x^ is the projection of the ith scene point into the jth image. This sum- 
mation is over all the discretized scene points (lying on the surfaces of the layers) 

X,. 



2.4 Evaluating the Priors 

Prior knowledge of parameter values is encoded in the prior probability term of 
Equation (P), p{9\l) = p{6s9t\^) = p(0s|I), as the value of the texture parame- 
ters is determined by the shape parameters and the images. The shape parameter 
vector Gs = {ot,(3) contains both individual shape parameters a and hyperpa- 
rameters j3. Hence the prior probability p{Bs\I) = p(a/3|/) = p(a|/3/)p(/3|/). 
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Table 2. Hyperpriors encoding a row of identical primitives. U\a, b] is the uniform 
distribution over the interval [a, 6]. N{a,b) is the normal distribution with mean a 
and standard deviation b. A column of primitives is similarly constrained, by imposing 
a hyperprior on x and dy. The model typically has a spatial extent of [-0.5, 0.5] in 
the X and y axes of the world coordinate system; hence a standard deviation of 0. 005 
corresponds to 2 or 3 pixels in a typical image of the scene. 




The term p(P\I), known as a hyperprior, expresses a belief in the overall struc- 
ture of the scene, while p{a\[3I) determines how individual shapes in the scene 
are expected to vary within the overall structure. To express complete prior 
ignorance about the scene structure, each prior probability may be assigned a 
uniform distribution bounded by the range of the cameras’ fields of view. The 
correct distribution for each hyperparameter should ideally be learnt automati- 
cally from previous data sets; however at present they are manually initialised. 
An example of a set of hyperpriors for a row of identical shapes is given in Table 
El Samples from this distribution are given in Figure El 



m m 






I 1 1 


■ ■ ■ 




ii 


m mm 







Fig. 4. Samples drawn from the hyperprior distribution for a row of identical primi- 
tives given in table 0 using two and three primitives. The intensity at each point is 
proportional to the depth offset from the background layer. 
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3 Model Selection 



In Section0a set of parameters was defined, and the posterior probability (Equa- 
tion ( 0 ) was introduced as a means of estimating the optimal parameter values 
for a given model. However a more fundamental problem remains: how to decide 
which model (i.e. which set of parameters) best represents a scene? This is the 
problem of model selection, described in this section. 

The goal of model selection is to choose the most probable of a finite set of 
models = l..n, given data D and prior information I. Using Bayes rule 

the probability of each model can be expressed as 



The denominator p(D|I) is constant for all models and hence is used only as a 
normalisation constant to ensure that = 1- The prior probabi- 

lity p(Mj |I) can be used to encode any prior preference one has for each model. 
In the absence of any such prejudice this is uniform, and model selection depends 
primarily on the evidence 



where Oj is the set of parameters belonging to model M^- . 

For this problem, the data D is simply a set of images of the scene. The prior 
information I is the projection matrix for each camera, and a noise model for 
projection into each image f Sect ion 12 ..‘-ill . The parameter vector 6j contains shape 
and texture parameters, as described in Section El Considering these separately, 
the evidence becomes 



The (3j term is dropped from the first factor of this equation, the likelihood of 
the data, as the probability of the data D is dependent only on the individual 
shape parameters aj. The texture parameters are not considered here as they 
are completely determined by the shape parameters and the images. 

3.1 Evaluation of the Evidence 

It is impractical to perform the integration of Equation (0) for any but the 
simplest models. However previous work (e.g. Esisiini) has shown that an 
approximation to the evidence is sufficient for model selection. Five possible 
approximations are briefly described here. Consider first the inner integral. 




_ p(D|M,-I)p(M,-|I) 



( 4 ) 






( 8 ) 
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In any useful inference problem the likelihood fuction obtained from the data 
will be much more informative than the prior probabilities — if not, there is li- 
ttle to be gained by using the data. Hence the likelihood function has a sharp 
peak (relative to the prior distribution) around its maximum value, as shown in 
Figure IHK a). The entire integral is therefore well approximated by integrating a 
neighbourhood of the maximum likelihood estimate of Uj. For convenience the 
log likelihood, logL(ofj), which is a monotonic function of the likelihood and 
hence is maximised at the same value, is considered rather than the likelihood 
itself. A second order approximation of log L(aj) about its mode ajML is 

logL(a^) = log L(ajMi) + - ajML)'^'H.{o!jML){oij - a^Mh), (9) 

where H(ajjvfi) is the hessian of logL evaluated at its mode. This corresponds 
to approximating the likelihood in this region by a multivariate Gaussian with 
covariance matrix Sq, = 



L{aj) = L{ajML)exp 






( 10 ) 



Assuming that p(oj |/3jMjI) = p^ajMLlPj'M.jT) in this region, the integrand of 
Equation OSjl reduces to 



exp 






daj = (27r)'=“/Vdet(i:„) (U) 



where ka is the number of shape parameters, as a Gaussian must integrate to 1. 
Hence the integral of Equation Q is approximately 



L(a, ml) (2^)'=“/ Vdet I7„p(a,ML|/3,M,I)p(/3, |M,I) (12) 



and the evidence is approximated as 

p(D|M,I) ^ L(a,ML)(2^)'=“/Vdeti:a J p(a,ML|/3,M,I)p(/3,|M,I)d/3, (13) 

The remaining integral can be similarly approximated; now the parameter values 
OijML are the “data” being used to estimate the hyperparameters j3j. The final 
expression for the evidence is therefore 



p(D|MjI) ^ L(o!jML)p(aiML|/3iMLMjI)p(/3jML|MjI)(27r)'"/^\/det Sa det 

(14) 

where k is the total number of parameters, .57^ is the covariance matrix of the 
hyperparameters with respect to the shape parameters and Pjml is the set of 
hyperparameters which maximise p{ajML\Pj'^j'i)- The terms to the right of 
the likelihood approximate the fraction of the volume of prior probability space 
enclosed by the maximum likelihood peak, and are known collectively as an 
Occam factor flDlltij . 
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3.2 Occam Factors 

An Occam factor encodes the idea of Occam’s Razor for model selection: a simp- 
ler model (usually one with fewer parameters) should be preferred to a more 
complex one, unless the more complex one explains or fits the data significantly 
better. Information theoretic techniques such as Minimum Description Length 
(MDL) encoding Q enforce this preference by penalising models according to the 
information required to encode them. The Bayesian approach to model selection 
naturally incorporates an identical penalty in the evaluation of the evidence as 
the product of a likelihood and an Occam factor. As extra parameters are intro- 




Fig. 5. Occam factor for (a) one and (b) two independent parameters with uniform in- 
dependent priors. For a single variable, the likelihood is approximated as p(D|aMZ,MI) 
and the Occam factor is w/p(a|MI). In the case of two variables the likelihood is 
p(D|aML&MiMI) and the Occam factor is WaWb / p{ab\M.l) . In each case the Occam 
factor approximates the fraction of the volume of prior probability (the shaded volume) 
occupied by the maximum likelihood probability peak. 



duced into the model, the fraction of the volume of parameter space occupied by 
the peak surrounding the maximum likelihood estimate inevitably decreases, as 
illustrated in Figure 0for the simple case of 1 and 2 parameter models. Because 
the volume of prior probability over the entire parameter space must always be 
1, this decrease in occupied volume translates to a decrease in prior probability, 
assuming that prior probability is reasonably uniform over the parameter space. 



3.3 Other Approximations to the Evidence 

The Occam factor is one of many possible model selection criteria. By disre- 
garding it completely, evidence evaluation reduces to a maximum likelihood 
(ML) estimation. This includes no preference for model parsimony and hence 
will always select the best fitting model regardless of its complexity. If the prior 
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terms p{ajML\!3jML^j'i) and p(/d;ML|MjI) are known to be almost uniform, 
Schwarz m suggests approximating them with diffuse normal distributions, and 
approximating det det S p hy N~ , where N is the number of observati- 

ons and k is the number of parameters in the model. This forms the Bayesian 
Information Criterion (BIC) measure of evidence 



log(p(D|M,I)) - log L(a, ml) - ^logiV. (15) 

A non Bayesian penalty term, the AIC has the form 

log(p(D|MjI)) ^ logL(ajML) - 2fc (16) 

and hence penalises models according to the number of parameters they include. 
Finally if the posterior distribution is very peaked, the MAP estimate of each 
model may be the same order of magnitude as the evidence, in which case one 
would expect it to perform just as well for model selection. Each of these criteria 
is compared in section 

4 Implementation Issues 

4.1 Initialisation 

The purpose of the initialisation stage is to provide a rough estimate of the 
number of planes to be modelled, and their position, scale, depth and orientation. 
First, each image is warped by a transformation Aj so that the layer Co is 
aligned. Approximate projection matrices are found by estimating each camera 
pose from Aj , as in m- A dense parallax field is obtained by applying a wavelet 
transform to each warped image, and performing multiresolution matching in the 
phase domain CH. The correspondences obtained from each pair of images are 
fused robustly to obtain depth estimates for each point, from which initial layer 
estimates can be hypothesised 

Initial parameter estimates are obtained by fitting the simplest model, a 6 
parameter rectangle, to each region (see Figure EJ). The centre of the rectangle 
is positioned at the centroid of the region. The horizontal and vertical scales 
are set to the average distance of each of the extrema of the region from the 
centroid, the depth is given by the depth of the centroid and the orientation is 
assumed to be vertical (i.e. w = 0). The projection matrices generated by this 
system have a typical reprojection error of order 1 pixel. 



4.2 Search for the Maximum Likelihood Parameters 

A multiresolution gradient descent search is used to locate the maximum like- 
lihood parameters Ujml for each possible shape model. The image is recursively 
convolved with a Gaussian filter and downsampled by a factor of 2 horizontally 
and vertically to obtain a multiresolution representation. The search is initialised 
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Fig. 6. (a) A poor initial reconstruction (compare with Figure\^ based only on stereo 
between two images, and (b) initial layer estimates based on three images. The layers 
are bounded by the black lines and meet the zero layer at the white lines. Each major 
offset layer is detected, but their shape and size are estimated poorly. 



at the coarsest level, and the estimate found is used to seed the search at a finer 
level. At each level the model is sampled more densely to maintain a constant 
sample rate of approximately one point per image pixel. Experience shows that 
two or three levels of resolution are sufficient, and the search typically converges 
in less than 100 iterations. 

5 Results 

5.1 Model Selection for the Shape Parameters 

Initially the model selection algorithm is assessed by trying to identify the correct 
shape for a single layer. Starting from the parameters found during initialisation 
(sectional, the gradient descent method described in Section O is used to find 
the model Mi maximum likelihood shape parameters omli for that layer. This 
parameter set is then used to initialise the search for the set oml for each of 
the models M2, M3 and Ad 4 in turn. Model selection is then performed using 5 
measures: Occam Factors (OF), maximum likelihood (ML), Bayesian Informa- 
tion Criterion (BIG), Akaike Information Criterion (AIC) and MAP likelihood 
evaluation. Results are given in Figure Cl 
Model Ml'. Rectangle 

Because the layer of the door is well represented by the rectangle model with 6 
parameters, the maximum likelihood parameters for more complex models are 
the same, and the ML measure is not altered for different models. Each other 
measure selects Mi because it includes the fewest parameters. 

Model Ad 2: Arch 

Models Adi and M 3 , which do not contain arches, are clearly inadequate for 
this layer. The maximum likelihood of models Ad2 and M 4 is very similar, so 
again the maximum likelihood measure is ambiguous while the other measures 
all select the simpler model Ad 2. 
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Model Ais: Bevelled Rectangle 

Models Ail and A42, which do not incorporate bevelling, fit the indentation 
poorly at its sloped edges. Models Ais and Ai 4 have similar likelihoods, so Ais 
is chosen by all measures except ML. 

Model Ai 4 '. Bevelled Arch 

In this case only the most complex model adequately describes the data. It is 
chosen by all measures depsite its complexity, as the likelihood of the data for 
this model is significantly higher than for other models. 

For each model, each model selection measure is clearly dominated the li- 
kelihood term. The OF, BIG, AIC and MAP measures all give similar results 
and appear adequate for preventing model overfitting. However the Occam fac- 
tor is more theoretically sound than the other measures and incurs little extra 
computational expense, and is therefore preferred. 

5.2 Model Selection for the Hyperparameters 

Having selected a shape model for each layer in the scene, it is possible to discern 
not only between individual shapes, but also between configurations of shapes. 
As a simple test case, the evidence for a set of layers having no geometric align- 
ment (p(/3) uniform, and uniform) is compared with the evidence for their 

belonging to a row of primitives, with priors given in Table |21 In this section, 
evidence is measured only using the Occam factor. 

Gateway scene: 

Figure laa) gives the layer models selected for each layer in the scene. In Fi- 
gure |HI(b) the evidence for each combination of two or more layers belonging to a 
row of identical primitives (black bars) is compared to the evidence for their being 
a uniformly distributed collection of shapes (white bars). No prior preference is 
expressed for either of these models. Clearly any combination of layers which 
includes the gateway is more likely to be part of a random scene, as the gateway 
is quite dissimilar in size and shape to the indentations. However the evidence 
for the two indentations taken by themselves belonging to a row is much higher 
than for their belonging to a general structure. Having detected this regularity, 
the indentations can be represented using 8 parameters (7 for one indentation, 
and the x position of the other) rather than 14. If such regularity can be detected 
in several collections of shapes, it can in turn be used to form hypotheses about 
higher level structure, such as the architectural style of the building as a whole. 
Gothic church scene: 

Figure Ud) gives the evidence for several combinations of layers from the seg- 
mentation in Figure EIc) belonging to a row as opposed to an arbitrary structure. 
There is a clear preference for a model with no regularity when the layers chosen 
include both windows and columns. However when only the windows are tested, 
the row model is clearly preferred, depsite the window parameters being slightly 
different due to some fitting errors caused by ambiguity near the boundary of 
each layer. Similarly the row model was preferred for the three columns, again 
allowing both a compact representation of the scene and the possibility of higher 
level inference about the scene strucure. 
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Model Mi: Rectangle 



Measure 


Model 


(xlO'‘) 


Ml 


M2 


Ms 


Ml 


OF 


1.4781 


1.4787 


1.4785 


1.4797 


ML 


1.4731 


1.4731 


1.4731 


1.4731 


BIG 


1.4764 


1.4775 


1.4775 


1.4786 


AIC 


1.4743 


1.4745 


1.4745 


1.4747 


MAP 


1.4750 


1.4753 


1.4753 


1.4757 




Model M2: Arch 



Measure 


Model 


(xlO'‘) 


Ml 


M2 


Ms 


Ml 


OF 


2.5870 


2.5678 


2.5872 


2.5691 


ML 


2.5817 


2.5618 


2.5807 


2.5618 


BIC 


2.5850 


2.5657 


2.5842 


2.5662 


AIC 


2.5829 


2.5632 


2.5821 


2.5634 


MAP 


2.5837 
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Model M 3 : Bevelled Rectangle 
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Model M 4 : Bevelled Arch 



Measure 


Model 


(xlO^) 


Ml 


M2 


Ms 


Ml 


OF 


2.3682 


2.3623 


2.3588 


2.3512 


ML 


2.3628 


2.3561 


2.3528 


2.3444 


BIC 


2.3661 


2.3599 


2.3566 


2.3488 


AIC 


2.3640 


2.3575 


2.3542 


2.3460 


MAP 


2.3649 


2.3586 


2.3552 


2.3472 




Fig. 7 . Evidence evaluation for single shapes. From left to right in each row: negative 
log evidence for this shape being an instance of each shape model, worst fit shape, best fit 
shape. Occam factor (OF), Maximum likelihood (ML), Bayesian Information Criterion 
(BIG), Akaike Information Criterion (AIC) and MAP probability measures are given. 
The model selected by each measure is in bold face. 
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Fig. 8. Testing for rows of similar shapes. The black bar is the (negative log) evidence 
for the shapes belonging to the row model; the white bar is the evidence for shapes 
having no regularity (see Section Xh.’^ . Evidence has been normalised by subtracting out 
common factors. 




Fig. 9. Recovered 3D surface of the Caius gateway scene. 
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5.3 Selecting the Number of Layers 

Comparison of the evidence can determine the number of layers present in a 
scene as well as their shape. Figure E3 gives the evidence for the gateway scene 
being modelled by 3, 2 and 1 primitives, which is clearly maximised for the 3 
primitive case. The subsequent addition of a spurious primitive such as a depth 
0 rectangle decreases the Occam factor while the likelihood remains constant, 
and hence is not selected. 




Fig. 10. Negative log evidence for different numbers of layers in the gateway scene. 
From left to right: evidence for 3 detected layers, evidence for the gateway and only one 
indentation, evidence for the gateway only, evidence for all layers including spurious 
rectangle (shown above). The 3 and 4 layer models are clearly preferred to those with 
1 and 2 layers; the 3 layer model is selected as it has a higher Occam factor. 



6 Conclusion 

This paper presents a novel approach to layer extraction with the aim of crea- 
ting a 3D model of the images that accurately reflects prior belief. This has been 
effected by a Bayesian approach with explicit, rather than implicit modelling 
of the distribution over segmentations. Given a hypothesised segmentation it is 
shown how to evaluate its likelihood and how to compare it with other hypo- 
theses. A variety of model selection measures are considered, all but the most 
basic of which prove adequate to prevent model overfltting for the architectural 
scenes on which this approach is demonstrated. The Occam factor is recommen- 
ded as is more accurate and theoretically sound while incurring minimal extra 
computational cost. 

The hierarchical nature of the shape model means that it is easily extended 
to more complex scenes than those presented here. Future work will extend the 
number and type of shape primitives modelled, and the number of levels in 
the hierarchical shape model. For example, it should be possible to infer both 
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the minimal parametrisation and the architectural style (e.g. Gothic, Georgian, 
modern bungalow) of the scene. A fully automatic initialisation scheme will use 
these hierarchical models to constrain an initial search for primitives in each 
image. 
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Abstract. The initialisation of segmentation methods aiming at the lo- 
calisation of biological structures in medical imagery is frequently regar- 
ded as a given precondition. In practice, however, initialisation is usually 
performed manually or by some heuristic preprocessing steps. Moreover, 
the same framework is often employed to recover from imperfect results 
of the subsequent segmentation. Therefore, it is of crucial importance for 
everyday application to have a simple and effective initialisation method 
at one’s disposal. This paper proposes a new model-based framework to 
synthesise sound initialisations by calculating the most probable shape 
given a minimal set of statistical landmarks and the applied shape model. 
Shape information coded by particular points is first iteratively removed 
from a statistical shape description that is based on the principal com- 
ponent analysis of a collection of shape instances. By using the inverse 
of the resulting operation, it is subsequently possible to construct in- 
itial outlines with minimal effort. The whole framework is demonstrated 
by means of a shape database consisting of a set of corpus callosum 
instances. Furthermore, both manual and fully automatic initialisation 
with the proposed approach is evaluated. The obtained results validate 
its suitability as a preprocessing step for semi-automatic as well as fully 
automatic segmentation. And last but not least, the iterative construc- 
tion of increasingly point-invariant shape statistics provides a deeper 
insight into the nature of the shape under investigation. 



1 Introduction 

The advent of the “Active Vision” paradigm in the 1980s came along with the 
idea of using model-based prior knowledge to simplify and stabilise the treat- 
ment of a specific vision problem. Since then, all kinds of active shape models 
have emerged in many application areas in various forms such as Snakes ma, 
deformable templates or active appearance models [2|. The amount of prior 
knowledge included in these models varies from simple general smoothness as- 
sumptions to very detailed knowledge about the shape and the image data to be 
expected. In the field of medical imaging, the usage of statistical shape models 
has found widespread use since the notion of biological shape seems 

to be best defined by a statistical description of a large population. 

Even though these statistical methods have proven to be fairly stable and 
reliable, there are cases where they fail completely in finding at least an approxi- 
mation of the correct object boundary. If a certain application asks for absolutely 
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flawless segmentations, alternative or supplemental frameworks must be applied 
to compensate for the missing functionality. On the one hand, we could employ 
semi-automatic mmm or manual segmentation tools that rely on a human 
operator providing the missing information. On the other hand, we may initialise 
the fully automatic procedure such that the correct solution is just nearby the 
initial configuration. Since almost all semi-automatic methods rely on suitable 
initialisations as well, the provision of a reasonable starting point seems to be a 
valuable extension of both approaches. Our main goal is therefore to provide a 
possibly interactive initialisation method that still takes into account the prior 
knowledge of the shape as far as possible. 

In order to keep the amount of required user input as small as possible, simple 
and intuitive interaction metaphors are of crucial importance for the design 
of such a tool. Since the most simple and probably most feasible interaction 
metaphor is still the adjustment of individual points lying on the boundary of 
the object under investigation, we are subsequently looking for a small number 
of points describing the overall shape of the object to be segmented — analogous 
to coarse control polygons of hierarchical shape descriptions that have recently 
been proposed in the held of modelling and animation [Yf24) . Such a “coarse 
control polygon” should capture as much prior shape knowledge as possible. 
And there should be a way to calculate the most “natural” fine scale shape 
given the correct arrangement of the control vertices. 

Our shape database should therefore be able to answer the following three 
questions: Which points along the object boundary are best suited for a compact 
and robust description of the shape? How many control vertices must be included 
in the coarsest control polygon? And how should the full resolution object be 
predicted so as to provide a reasonable initial outline? 

In search of answers to these questions, we have decided to pursue the fol- 
lowing strategy: Using statistical shape analysis, we examine the remaining va- 
riability of shape, if the variation coded by the position of individual points 
is progressively subtracted. The coarsest control polygon necessary to capture 
the main shape characteristics is complete as soon as the remaining variability is 
small with respect to the working range of the subsequent segmentation method. 
And the most probable shape for a given control polygon can then be calculated 
by just inverting the process of subtracting the variation of control vertices. 



2 Experimental Set-Up 

In order to have a compact statistical shape description at our disposal, we em- 
ploy a representation that is based on a principal component analysis (PCA) 
of all object instances in our database. This approach, first proposed by Cootes 
and Taylor in 0, has the very useful property to reflect the shape variations 
occurring within the population by a complete set of basis vectors. These basis 
vectors span a linear shape space containing all the instances of our collection. 
This enables us to apply the whole framework of linear algebra to make the stati- 
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Stic point- wise invariant. Furthermore, the properties of such a shape description 
are well understood and appropriately documented Cl- 
in addition to a statistical description method, we need a population of se- 
veral object instances representing our model-based foreknowledge. The PCA 
is therefore applied to a collection of 71 hand segmented outlines of the cor- 
pus callosum on mid-sagittal MR-slices. Five randomly selected examples of this 
database are illustrated in Fig. 0 All aspects of the model building process 
regarding this population are described in detail in m- 

Since we aim at working with a vertex-based control polygon at interactive 
speed, the original representation based on elliptic Fourier descriptors boiHfii! 
has been converted to a polygonal representation by equidistantly sampling the 
parameter space of the outline. For the following analysis, we assume that the 
underlying arc-length based curve parameterisation with normalised parameter 
starting point provides a sufficiently good correspondence between the individual 
specimen. All experiments we performed suggest that the achieved correspon- 
dence is not faultless but sufficiently precise for our intentions (see also mi)- 
In order to normalise the model contours, we represented the vertex positions 
as usual with respect to an anatomical coordinate system given by the AC-PC 
line. Experience shows that these anatomical landmarks can easily be located 
and are very stable with respect to the corpus callosum. 




Fig. 1. Five randomly selected corpora callosa from our collection that consists of 71 
examples. 

In the following. Section Elreviews shortly the statistical shape analysis using 
principal components and fixes the mathematical notation. In Section ^ we 
discuss in detail the aforementioned progressive subtraction of variation, and 
Section Eldescribes subsequently the inversion of this operation. The initialisation 
procedure founding on the presented framework is evaluated in Section Elfor both 
interactive and fully automatic mode of operation. Finally, Section 0 concludes 
this report and outlines the next steps towards a highly robust initialisation 
oracle. 
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3 Shape Analysis Using Principal Components 

The basic idea of statistical shape analysis using principal components consists 
in separating and quantifying the main variations of shape that occur within a 
population of several instances exemplifying the same object. More precisely, a 
PCA defines a linear transformation that decorrelates the parameter signals of 
the original shape population by projecting the objects into a linear shape space 
spanned by a complete set of orthogonal basis vectors. If the parameter signals 
are highly correlated, then the coarse scale variations of shape are described by 
the first few basis vectors, whereas fine details are captured by the remaining 
ones. Furthermore, if the joint distribution of the parameters describing the 
shape is Gaussian, then a reasonably weighted linear combination of the basis 
vectors results in a shape that is similar to the existing ones. On the other 
hand, if the joint distribution of the parameters is highly non-Gaussian or if the 
dependencies of the parameter signals are non-linear, then other decomposition 
methods such as the independent component analysis [3 should be employed. 

As already mentioned, the considered population consists of A^ -I- 1 = 71 cor- 
pus callosum instances, given as polygonal models pi = . . . , 

with M = 256 points. Since we will later compare statistic-based initialisations 
to the ground truth given by one object instance, we always exclude this particu- 
lar instance from the statistic for cross-validation. To simplify the formalism, we 
centre the parameter signals of the shapes beforehand by calculating an average 
model p and an instance specific difference vector Ap^: 

1 ^ 

^P* = P»-P> AP = [Api ■ ■ ■ Apn] (1) 

' i=l 

Note, the N = 70 difference vectors span only a 69-dimensional space; the mis- 
sing dimension obviously originates from the linear dependence ^P* = 

0. The corresponding covariance matrix S G JR 2 Mx 2 M jg consequently rank- 
deficient. As has been pointed out in 0, this circumstance can be exploited to 
speed up the calculation of the 69 valid eigenvalues and eigenvectors: Instead of 
calculating the full eigensystem of the covariance matrix E, the multiplication 
of the eigenvectors of a smaller matrix E with AP leads to the correct principal 
components: 

E = AP^AP UA'U^, 

N -1 

U' =[ui ■■■ ujv-i un] = Ip [ap u'J , 

As an alternative that is not equally fast but conceptually more elegant, we 
propose to work in a subspace with a complete set of basis vectors to find the 
eigensystem of our data. To do so, we project the difference vectors Api into 
a lower dimensional space whose basis M is constructed by the Gram-Schmidt 
orthonormalisation y: 

M = [mi • • • mAT_i] = x(Api, . . . , ApAf_i) , Api = M'^ Ap^ (3) 



A' = diag(Ai, . . . , Aat_i,0) 

( 2 ) 

tp (A) = Normalise columns of A 
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Note, one arbitrary Api must be dropped for the construction of M, and Api 
denotes the projection of Api into the subspace spanned by M. The covariance 
matrix S and the resulting PC A given by the eigensystem of S can subsequently 
be calculated according to: 

i=l 

The principal components defining the eigenmodes in shape space are then given 
by back-projecting the eigenvectors U: C/ = [ui • • • utv_i] = MU. Each object 
instance can be represented as a linear combination Pi = p -I- Uhi of these 
eigenmodes, where = \b^l\ . . . , . In order to calculate the uncorrelated 

coordinates of each object instance, we project the difference vectors Api into 
the eigenspace: b^ = U'^Api. 

The first four eigenmodes resulting from the PC A of our population are 
displayed in Fig. El^a). The shapes representing the first eigenmode on the left 
are calculated by adding the weighted first eigenvector Ui to the average model 
p. The following three shape variations to the right of the first one are calculated 
correspondingly. 



4 Progressive Elimination of Variation 

Given a statistical analysis as defined above, we consider the following situation: 
After having defined the shape coordinate system by locating the AC-PC line, 
the initialisation of a new object instance starts with the average model p, as 
illustrated in Fig. Eta) on the left. Let us assume for the moment that the 
aforementioned coarse control polygon consists of the three marked vertices on 
the outline of the mean shape. To generate an initial approximation of the object, 
we define now a set of boundary conditions for the global shape by moving the 
control vertices to an approximately correct position. Given these constraints and 
our prior knowledge of the shape, we wish to choose that outline for initialisation 
which is most natural in that case. In the following two sections we will show, 
how this most probable outline can be found. 

Since we hope that some control vertices carry more shape information than 
others, we approach the whole problem iteratively. In a first step, we calculate 
the most probable shape that satisfies only the boundary conditions provided by 
the most important control vertex. For the second most important control point 
we use subsequently the resulting outline as initial configuration. This process 
is then repeated until we can satisfy all the boundary conditions. Since we do 
not yet know how to determine the most important control vertex, we will first 
investigate the computation of the most probable shape given the position of an 
arbitrary point. This will be the subject of the next subsection. The problem of 
finding the points carrying most shape information will be discussed afterwards 
in subsection 
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(a) Initial average model (left) and 
correct segmentation (right) 





(b) Basis vectors Rj 



Fig. 2. (a) Boundary conditions for an initial outline are established by prescribing 
a position for each coarse control vertex, (b) Shape variations caused by adding the 
two basis vectors Rj to the average model, inducing x- and y-translations of point 
j, respectively. The various shapes are obtained by evaluating p + u> U with lo G 
{-2,... ,2} and k G {xj,yj}. 

4.1 Shape-Based Basis Vectors for One Point 

To start with, we must translate our conceptual goal into mathematical terms. 
Since the most probable shape is given by the mean model p in the context of 
PCA, we can reinterpret the notion of “choosing the most probable outline” as 
“choosing the shape with minimal deviation from the mean” . And this means 
nothing else than choosing the model with minimal Mahalanobis-distance Dm, 
the common metric in eigenspaces. 

The key idea enabling the solution of our first problem can now be summari- 
sed as follows: We must find two vectors in the space of variation that describe 
decoupled x- and y-translations of a given point j with minimal variation, respec- 
tively. In other words, these two vectors should cause a unit translation of vertex 
j in either x- or y-direction, and they should have minimal Mahalanobis-length 
Dm- If we have found them, we can satisfy all possible boundary conditions cau- 
sed by one vertex with minimal variation by just adding the two appropriately 
weighted “basis” vectors to the mean. This problem gives rise to the following 
constrained optimisation: 

Let and denote the two unknown basis vectors causing unit x- and 
y- translation of point j, respectively. The Mahalanobis-length Dm of these two 
vectors is then given by: 




e=l 



Dm{Vk) = {UVkfS-^ijVk 



\ 



k^{xj,yj} (5) 
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Taking into account that Xj and yj depend only on two rows of U, we define the 
sub-matrix Uj according to the following expression: 



Xj 





Xj 


+ 


U 2 j -1 o 


b = 


Xj 


Vi. 




yj. 




^ 2 j o 




Vj. 



Uj o = row of U (6) 



In order to minimise the Mahalanobis-distance subject to the constraint of 
a separate x- or y-translation by one unit, we establish — as is customary for 
constrained optimisation — the Lagrange function L: 






Af-1 

E 

e=l 



Ae 



- Ifc [UjJ^k - 6fc] 






= { 



Y 




O' 


0 


1 


1 



} ( 7 ) 



The vectors l^j and \y^ contain as usual the required Lagrange multipliers. To 
find the minimum of L(rk,lk)^ we calculate the derivatives with respect to all 
elements of and ly^ and set them equal to zero: 



SL{rk,h) „ 

Svk 

SL{rk,h) ± „ 



2 : 
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Ai 










\-Uj 




r., i Xy. 
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2 : 
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Yj . lyj_ 




'■ ey._ 



If the basis vectors and the Lagrange multipliers are combined according to 
Rj = [r^;^ Tyj] and Lj = [l^,^ ly^.], Eq. Q can be rewritten as two linear matrix 
equations: 



2A-^R,=UjLy 
UjRj = I 



(9) 

( 10 ) 



The two basis vectors and Xy^ resulting from simple algebraic operations 
(resolve (0 for Rj and replace Rj in (ED by the result, use to resulting equation 
to find Lj = 2[UjAUj’]~^ and substitute for Lj in ( 0 ) are then given by: 

= K [UjAU^]-^ ( 11 ) 

While Tx ■ describes the translation of Xj by one unit with constant yj and mi- 
nimal shape variation, Xy^ alters yj correspondingly. The resulting effect caused 
by adding these shape-based basis vectors to the average model is illustrated 
in Fig. 0b). The most probable shape p given the displacement [Axj, Ayj]"^ of 
control vertex j is consequently determined by 



p = p+URj 



Ax 4 
.^yj. 



( 12 ) 



Another possibility to find the two basis vectors Rj consists in exploiting the 
least-squares property of the Moore-Penrose pseudo-inverse. The basic idea in 
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this context is to solve the highly under-determined linear system UjVk = gr 
(representing the prescribed constraints) by calculating the pseudo-inverse 
Since we are not looking for the normal least-squares solution but for the one with 
minimal Mahalanobis-distance, we have to introduce a weighting of the rows of 
C/j , in order to map the problem into normal Euclidean space, where the minimal 
solution is then given by the generalised inverse. As shown in Appendix El this 
approach leads to the same basis vectors Rj and validates Eq. (HU, since there 
is only one unique element in the hyper-plane of all solutions that has minimum 
Mahalanobis-norm. 



4.2 Point-Wise Subtraction of Variation 



In the previous subsection we have seen how to choose the most probable shape 
given the position of one specific control vertex j. Before we can now proceed 
to the next control point, we must ensure that subsequent shape modifications 
will not alter the previously adjusted vertex j. To do so, we must remove those 
components from the statistic that cause a displacement of this point. Unfor- 
tunately, we cannot apply a projection for this purpose, since the basis vectors 
Rj are not orthogonal in the shape space. Therefore, we propose to subtract the 
variation coded by the point j from each instance i, and to rebuild the stati- 
stic afterwards. For the first part of this operation, we must subtract the basis 
vectors Rj weighted by the example-specific displacement [Axj, Ayj]f from the 
parameter representation of each instance i: 



h] = b,; - R. 



Axj 

^Vj 



= bi - Rj Uj b, = (7 - Rj Uj) bi, Vi G {1, .. , N} (13) 



Doing so for all instances, we obtain a new description of our population which 
is invariant with respect to the point j (denoted by o-^). The variability in this 
point-normalised population is expected to be smaller compared to the original 
collection. In order to verify this assumption and to rebuild the statistic, we 
apply anew a PC A to the normalised set of instances {b^ \i G {I,... ,V}}. 
Note, the eigenspace shrinks by two dimensions since we removed two degrees 
of freedom. The resulting principal components, denoted by , confirm the 
expected behaviour and validate also the removal of the variation of point j. 
The first four one-point invariant eigenmodes are illustrated in Fig. 0Kb). 



4.3 Point Selection Strategy 

The point-wise elimination of variability presented above can subsequently be 
repeated for several points, until the remaining variability is small enough with 
respect to the working range of the subsequent segmentation algorithm. In order 
to achieve optimal results and to find the most compact control polygon, we 
should now explore the strategy for the selection of control points. Since we 
aim to choose those vertices that carry as much shape information as possible, 
we should select the points according to their “reduction potential”. A control 
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Fig. 3. (a) The first four eigenmodes of 70 corpus callosum instances. The various sha- 
pes are obtained by evaluating p -|- ui^/XtUk with u> G {—2, ... ,2} and fe G {1, . . . , 4}. 
(b) The first four one-point invariant eigenmodes after subtracting the first prin- 
cipal landmark. The various shapes are obtained by evaluating p + with 

LU G {—2, ... ,2} and fc G {1, . . . ,4}. 



vertex holds a large reduction potential, if the remaining variability after its 
elimination is small. 

To make the following formalism as precise as possible, we introduce some 
additional definitions at this point: Firstly, we will subsequently refer to the 
point being removed from the statistic as the principal landmark. Secondly, 
let the sequence Sk = {ji,... ,jk} denote the set of point-indices of those k 
principal landmarks that have been removed from the statistic in the given 
order. And last but not least, the superscript o'*'' is used for the value of o, if 
the principal landmarks Sk have been removed. 

Using this formalism, the reduction potential P of vertex jk, being a candi- 
date to serve as the A:**' principal landmark, can be defined as follows: 

N-l-2(k-l) 

P{jk) = - X! Sfe = {ji,... ,jfc} 

(14) 

Figure 21 a) shows the reduction potential for all the points of the original model. 
In order to remove as much variation as possible, we choose consequently that 
point as the first principal landmark that holds the largest reduction potential: 

ji — max[P(j)]. The selected vertex and the resulting point-invariant statistic 
j 

after its elimination have already been shown in Fig. m- 

If we apply this selection and elimination step twice again, we end up with 
the second and third principal landmark. The corresponding eigenmodes and 
the selected points are depicted in Fig. 0 The decreasing deviations from the 
mean indicate that the variation within the population is progressively reduced 
by this operation. The observation of the overall variance subject to progressive 
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(a) (b) (c) 



Fig. 4. (a) & (b) Reduction potentials for the selection of (a) the first and (b) the 
second principal landmark. For each point j in the abscissa, the reduction potential 
P{j) is displayed. Note, the first principal landmark ji = 196 has minimal reduction 
potential in (b), because subtracting the same point twice has no effect at all. (c) 
The overall variance tr(X’®'=) of the population depending on the number of subtracted 
principal landmarks. 

point removal (see Fig. ifc)) verifies this hypothesis and shows that the varia- 
bility decreases surprisingly fast in the beginning. Later on, after three vertices 
have been processed, the decline levels out and the benefit of each additional 
principal landmark becomes fairly small. This finding suggests that the main 
shape characteristics of a corpus callosum can be captured by only three or four 
principal landmarks. 





(b) Three-point invariant eigenmod.es 



Fig. 5. Remaining variability after vertex elimination of (a) two and (b) three principal 
landmarks. 



5 Initial Shapes for Segmentation 

The progressive application of the point selection and removal process enables 
now the construction of the most compact principal control polygon, consisting 
of the first few principal landmarks. Analogous to traditional parametric curve 
representations, each control point has two associated principal basis functions 
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(URj) that are globally supported. The final outline p; based on a principal 
control polygon with I vertices is then given by the inverse operation of the con- 
struction, that is, by the combination of the mean shape p and all the weighted 
principal basis functions 



Pi 



i 

E 

k^l 






3k 



.^V3k_ 



(15) 



Note that the weights [Axj^.Ayj^Y' for the basis vectors depend 

on the shape defined by the previous principal landmarks Sfe_i. Therefore, if 
any control point jk is modified and the less important landmarks shall remain 
in their position, the weights . . . ,[Axj^, Ayj^]’^} must be 

recalculated in the correct order of vertex removal. To emphasise the hierarchical 
structure of our formalism and to simplify the algorithmic implementation, we 
recommend to use the following recursive definition instead of equation (lEJ: 



Po = P , 



Pk = Pfe-i + U’^’^-^R 



Sfc-1 

3 k 



Axj 

Ay^ 



3k _ 



(16) 



With this shape-based curve representation p, the last piece has fallen into 
place. By utilising a minimal principal control polygon with associated basis 
functions, we are now able to fulfil all our original objectives: The initialisation 
of a new shape instance results in the simple adjustment of a small number of 
points, taking into account all our prior knowledge of the shape. 

In order to validate the quality of the proposed method, we will subsequently 
show some results of cross-validation experiments that have been performed with 
each shape instance in our database. It goes without saying that the test instance 
has always been removed from the statistic. The initialisations to be presented 
have been generated by moving the principal landmarks into the positions of the 
corresponding points on the outline of the respective test object. A selection of 
the results of these experiments is illustrated in Fig. 0 and can be summarised 
as follows: The initial average model in Fig. El(a) converges efficiently towards 
an approximation of the correct shape whilst the control vertices are adjusted. 
In most of our examples, only three or four principal landmarks are necessary to 
provide a reasonably good initialisation. The consideration of more than five or 
six points does not significantly improve the quality of the initial shape. In some 
cases, the initialisation even deteriorates slightly, if too much control vertices are 
employed. This behaviour may also indicate some deficiencies of the underlying 
correspondence function. To show a representative cross-section of the achieved 
results, Fig. 0b) displays four truly randomly chosen experiments, where four 
principal landmarks have been adjusted. 



6 Interactive and Automatic Initialisation 

The major question remaining to be answered is, whether the proposed frame- 
work proves its worth in practical application as well. Although we have not yet 
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Fig. 6. (a) Generation of an initial outline for segmentation; shape instance in black and 
fitted initialisations in gray with an increasing number of htted principal landmarks, 
(b) Initial shapes with four adjusted principal landmarks for the segmentation of four 
randomly chosen instances. 



gained any experience in everyday clinical application, our experimental results 
are fairly convincing. Tests have been performed for both interactive and fully 
automatic initialisation. 

The interactive approach simply uses the underlying shape basis as a hig- 
hly specialised curve representation. The required adjustments of the principal 
landmarks must be provided by a human operator. Since the recalculation of 
the outline can be done at interactive speed, the instant feedback supports the 
operator in finding an appropriate initialisation within a few seconds. In most 
of the cases, three principal landmarks are sufficient to define a coarse initiali- 
sation. At most three additional control vertices can then be used to refine the 
characteristic details of the shape. Figure 0^ a) shows one possible initialisation 
based on six manually adjusted principal landmarks. 

By exploiting the statistical prior knowledge of the shape once again, we 
can even eliminate the remaining interaction: For each principal landmark jk, 
we calculate the covariance matrix in order to determine its positional 

variability. On the assumption that a landmark is Gaussian distributed, we 
can then compute a confidence ellipse that contains the corresponding control 
point with probability xi(c) = P{\(^\ < c) (see e.g. 0 ). The new auxiliary 
variable a; is, as usual in this context, a standardised random vector with nor- 
mal distribution: ~ A/”(0, 1) A a; ~ Af{0,I). Since it is well known that 

< 3) ~ 99%, we can construct the main axes aj^ and bj^, of the 
confidence ellipse that contains the principal landmark with a probability of 99% 
by the following linear transformation of a;: 
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(17) 



Figure mb) shows these confidence ellipses for all considered control vertices. As 
expected, the length of the axes aj^. and bj^. declines with increasing k, according 
to the smaller variances in the underlying statistics. 
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Fig. 7. (a) Interactive initialisation by manual adjustment of six principal landmarks, 
(b) Initial average model with the confidence ellipses of the control vertices, (c) & (d) 
Automatic initialisation by the sequential optimisation of the matching function Gk 
for (c) three and (d) six principal landmarks. 



For an automatic initialisation, we can subsequently use these confidence 
intervals as the region of interest with respect to an optimisation of the fit. 
The goal function of such an optimisation should measure the correspondence 
between the shape to be optimised and the actual image data I . In order 
to simplify and accelerate the optimisation process, we propose to fit only one 
principal landmark at a time, analogous to manual initialisation. By employing 
a very popular matching function based on the image gradient V/, we end up 
with the following goal function Gk'- 

M 

Gk{Axj^,Ayj^) = II ] ||, J : Image data (18) 

e=l 

Note that Gk depends on the results of the previously optimised principal 
landmarks Sk-i, since the centre of the confidence ellipse k is given by Pk'^i- A 
closer inspection of the goal functions Gk within the confidence ellipse k shows 
that the most important goal functions Gi, G 2 , and G 3 exhibit several local 
minima and maxima. But apart from this minor difficulty, their overall behaviour 
is fairly smooth and regular. However, due to the hierarchical dependencies, 
it is essential to reliably locate the global maximum. Therefore, we propose 
the following simple optimisation scheme: In a first step, we sample the goal 
function within the bounding box of the confidence ellipse on a coarse grid, in 
order to find the local neighbourhood of the global optimum. Having done so, 
we apply the Newton-Raphson method to find the proper optimum. Since the 
computation of a Newton-Raphson iteration includes the calculation of first and 
second derivatives, we recommend to fit a bivariate Taylor polynom of fourth 
degree around the estimated optimum, instead of relying on discrete derivatives. 
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This optimisation scheme has proven to be robust and it finds reliably the 
global optimum with high sub-pixel accuracy. The time taken to optimise one 
principal landmark amounts to about one second on an SGI O^- Figure □(d) 
shows the result of the initialisation, if the optimisation is sequentially applied 
to six principal landmarks. With the exception of the Splenium of the corpus cal- 
losum, the resulting outline is very close to the optimum. A comparison between 
the final outline and the manual initialisation in Fig. EKa) shows two differences: 
On the one hand, the automated method obviously detects the border of the 
shape with higher precision. On the other hand, manual initialisation seems to 
be superior with respect to an overall fit to the image data. Although we could 
speculate that the better estimate induces a higher distortion of the correspon- 
dence function, we suspect that the superior performance has another reason: 
The manual approach simply finds a better solution regarding the problem of 
optimising the position of all principal landmarks at once. The automatic opti- 
misation of this problem is much more difficult due to the dependencies of the 
goal functions Gk and has not yet been investigated. 



7 Conclusion and Future Research 

In search of a stable initialisation oracle that is based on a small number of points, 
we presented a new way to make a statistical shape description point-wise inva- 
riant. The inverse of the resulting operation generates initial configurations for 
subsequent segmentation by choosing the most probable shape given the esti- 
mated control polygon. The whole framework has been evaluated by means of 
a shape population consisting of 71 corpus callosum instances. To demonstrate 
its practical benefit, we implemented both an interactive and a fully automatic 
initialisation method. The achieved results are satisfying and validate its suitabi- 
lity for our initialisation purposes. Furthermore, we gained a deeper insight into 
the nature of the shape under investigation by finding the most compact shape 
description given by the principal control polygon with associated principal basis 
functions. 

Additional work has to be done in order to evaluate and improve the prac- 
tical application of the proposed shape analysis. In the context of interactive 
initialisation, we must explore the influence of the point selection strategy on 
the user’s ability to locate the prescribed vertices in the image. Since we choose 
the principal landmarks purely on the basis of a statistical measure, problems 
may arise in locating the correct position of the points in the image to be seg- 
mented. Hence, another point selection strategy could be based on the analysis 
of local shape and image characteristics. Control vertices with salient local curve 
features or locations with stable image characteristics could serve as landmarks 
well suited for automatic or interactive localisation. Such point selection oracles 
should be combined with our statistical selection strategy. Moreover, the auto- 
matic initialisation should be improved by optimising all the control vertices at 



once. 
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Last but not least, model-based initialisations for surfaces should be provided 
as well, in order to overcome the limitations imposed by the two-dimensional 
segmentation approach, if three dimensional data sets are available. And if we 
broaden the horizons beyond the borders of computer vision, we surmise that 
our framework could be of great value for the interactive animation of various 
natural objects. 
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A Derivation of the Basis Vectors Rj by Means of the 
Pseudo-Inverse 

Another approach to find the two basis vectors and causing unit transla- 
tion in X- and y-direction with minimal Mahalanobis-distance involves the 
calculation of the generalised Moore-Penrose pseudo-inverse mm . To derive a 
solution with this concept, we use only the prescribed constraints as a starting 
point : 

UjRj = I (19) 

Since Uj is a 2 x {N — 1) matrix, the linear system of equations in nH is highly 
under- determined. Such a system either has no solution or there will be an (N—S) 
dimensional family of solutions. In the second case, one can show that there is 
a unique element in the hyper-plane of all solutions which has minimum 2-norm 
m- It is well known that this least-squares solution can be found by calculating 
the generalised Moore-Penrose pseudo-inverse C/j^ . The resulting vectors Rj with 
minimal Euclidean norm are then given by 

R, = Ufl = Uf . (20) 

Unfortunately, we are not looking for the solution with minimal 2-norm but 
for the one with minimal Mahalanobis-distance Dm. For this reason, we in- 
troduce the Mahalanobis-norm || o \\m that can be expressed in terms of the 
traditional Euclidean norm: 

||x|U = ||v/A''x ||2 (21) 

If we are able to calculate the least-squares solution with respect to this Mahala- 
nobis-norm, we have automatically found the solution with minimal Mahala- 
nobis-distance, since 

l|x||^ = ||^/A ^x||^ = (VA ^x) (yi ^x) = x^A"^x = D, 



( 22 ) 
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By exploiting relation lai), we can map the minimisation of the Mahalanobis- 
norm to the normal least-squares problem with respect to the Euclidean norm. 
The required transformation results in a weighting of the columns of Uj by the 
square root of the corresponding eigenvalues A: 

rnin||i?j|U: UjRj = I ^ min\\Rj\\2 : (UjVa) Rj = I (23) 

Rj R, V y 

= ^^A (u,Va) * (24) 

As illustrated in Eq. m, the minimal m-norm vectors Rj^i„ are then given by 
the scaled version of the least-squares solution Rj^^^ that is uniquely determined 
by the pseudo-inverse of (C/,VA). By exploiting subsequently the relation = 
A^ that holds for m x n matrices A with (m < n) and rank(A) = m 

(see ■0), we end up with a fairly familiar result: 

Rj = ^fA (UjVA^ * = VA (\/AC// [UjAUf] (25) 
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Abstract. A Bayesian approach to object localisation is feasible given suitable 
likelihood models for image observations. Such a likelihood involves statistical 
modelling — and learning — both of the object foreground and of the scene back- 
ground. Statistical background models are already quite well understood. Here we 
propose a “conditioned likelihood” model for the foreground, conditioned on va- 
riations both in object appearance and illumination. Its effectiveness in localising 
a variety of objects is demonshated. 



1 Introduction 

Following “pattern theory” Emu, we regard an image of an object as a function 
/(x), X G 2? C TZ^, generated from a template image /(x) over a support S that 
has undergone certain distortions. Much of the distortion is accounted for as a warp of 
the template /(x) into the image by a warp mapping Tx- 

/(x) =/(Tx(x)), XG 5, (1) 

where Tx is parameterised hy X £ X over some conhguration space X, for instance 
planar affine warps. We adopt the convention that A = 0 is the template configuration 
so that Tx is the identity map when X — 0. 

Using the warp framework, “analysis by synthesis” can be applied to generate the 
posterior distribution for A . Given a prior distribution po{X) for the configuration X , and 
an observation likelihood L{X) = p{Z\X) where Z = Z{I) is some hnite-dimensional 
representation of the image I, then the posterior density for X is given by 

p{X\Z)^Po{X)p{Z\X). (2) 

This can be done very effectively hy factored sampling which produces a weighted 
“particle-set” TTi), . . . , (s^'^^ttat)}, of size iV that approximates the posterior [2ll. 

From this approximation of the distribution fusion of inference about X from different 
sensors, over time and across scales. It also allows a structured way of incorporating 
prior knowledge to the algorithm. 

Much of the challenge with the pattern theory approach is in constructing a suitable 
matching score. Examples of non-Bayesian approaches include correlation scores 
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BnHEl and mutual information gTII . But factored sampling, calls for a Bayesian ap- 
proach in which both the foreground and background image statistics are modelled 111 41 . 
In particular, modelling a likelihood p{Z\X) in terms of the foreground/background 
statistics of receptive held outputs is employed in Bayesian Correlation Although 
background statistics for Bayesian Correlation, and their independence properties, are 
quite well understood 111 1122111281012511 foreground statistics are more complex. 



Foreground statistics should be characterised by the response of a receptive held 
conditioned on its location relative to the object and on the object’s pose. This can be 
achieved by performing template subtraction. This increases the specihty and selectivity 
between background and foreground over the method of adhoc foreground “partitioning” 
implemented in E51- The weakness of the latter approach is demonstrated in hgureQ] 
Even when receptive helds are mutually independent over the background, independence 
need not necessarily hold over the foreground. It was hoped that the new foreground 
measurements would also be decorrelated and/or independent. However, it turns out 
that the statistical dependencies between measurements are not greatly affected by the 
template subtraction. This paper proposes a more acutely tuned foreground likelihood, 




Fig. 1. Simple foreground partitioning gives poor selectivity. An decoy object produces an 
alternative likelihood peak of sufficient strength that the mean configuration (black contour) is 
substantially displaced from the tme location of the head, (white contours represent the posterior 
distribution; wider contours indicate higher likelihood for the face object.) 



conditioned explicitly on variability of pose and illumination, that pays greater respect 
to the deterministic properties of the object’s geometric layout. 



2 Modelling Image Observations 



In the framework presented here, image intensities are observed via a bank of filters, 
isotropic ones in the examples shown here, though steerable, oriented filters fzl would 
also be eminently suitable. The likelihood of such observations depends both on foregro- 
und and background statistics and this approach is reviewed below, before looking 
more carefully at foreground models in the following section. 
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2.1 Filter Bank 

The observation Z = Z(T) is to be a fixed, finite dimensional representation of the 
image /, consisting of a vector Z = (zi, . . . , zk) whose components 

Zk= f fFa:s,(x)/(x)dx, (3) 

JSk 

are an inner product of the image with a filter function Wx^., over a finite support Sk- 
in it was argued that a suitable choice of filter function is a Laplacian of Gaussian 
Wx, centred at x: 

W'x(x') = V2 G,(x' - x) 

with hexagonally tesselated, overlapping supports as in figure^ The scale parameter of 




Fig. 2. Tessellation of filter supports. Filters are arranged in a hexagonal tessellation, as shown, 
with substantial overlap (support radius r = 40 pixels illustrated). 



the Gaussian is a and it is adequate to truncate the Gaussian to a finite support of radius 
r = 3(7. The tessellation scheme was arrived at EHl by requiring the densest packing 
of supports while maintaining statistical de-correlation between filters over background 
scene texture. In practice, at that separation, filter responses are not only decorrelated 
but also, to a good approximation, independent over the background. 



2.2 Probabilistic Modelling of Observations 

The observation (ie output value) z from an individual filter is generated by integration 
over a support-set S such as the circular one in figure Gl which is generally composed 
of both a background component B{X), and a foreground component F{X): 

z\X^ I FF(x)/(x)dx-F / lF(x)/(x)dx. (4) 

J B{X) Jf{X) 

' V ' 

MAIN NOISE SOURCE 

Densities p^{z\p) and p^{z\p), 0 < p < 1 for the background and foreground com- 
ponents of mixed supports must be learned. Then, a particular object hypothesis X is 
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Fig. 3. Foreground and background filter components A circular support set S is illustrated 
here, split into subsets F{X) from the foreground and B{X) from the background. Assuming 
that the object’s bounding contour is sufficiently smooth, the boundary between foreground and 
background can be approximated as a straight line. The support therefore divides into segments 
with offsets 2rp and 2r(l — p) for background and foreground respectively. 



evaluated as a global likelihood score p(Z|X), based on components z\, . . . , Zk which 
need to have either a known mutual dependence or, simpler still, be statistically inde- 
pendent. Then the observation likelihood can be constructed as a product 

K 

p{Z\X) = l[p{zk\X). (5) 

k=l 

containing terms p{zk\X) in which the density p{zk\X) depends, to varying degrees 
according to the value of X, on each of the learned densities p^ and p® for the foreground 
and the background model. This places the requirement on the filter functions , that 
they should generate such mutually independent Zk- As mentioned in section IZTl this 
is known to be true for Zk over the background. Here we aim to establish independence 
also over the foreground. 



3 Modelling the Foreground Likelihood 

The modelling of background components is straightforward (2^, simply inferring a 
density for responses z from a training set of filter outputs z„, calculated from supports 
Sn dropped at random over an image E51. Thenp®(z|p) can be learned for some finite 
set of p-values, and interpolated for the p-continuum. A similar approach can be used 
for the foreground case p^ but with some important additional complexities however. 

3.1 Spatial Pooling 

The distribution p®(z|p) is learned from segments dropped down at random, anywhere 
on the background. Over the foreground, and in the case that p = 0, p^{z\p) is similarly 
learned from a circular support, dropped now at any location wholly inside the training 
object. However, whenever p > 0, the support F{X) must touch the object outline; 
therefore p^{z\p) has to be learned entirely from segments touching the outline. Thus, 
for p = 0, statistics are pooled over the whole of the object interior — “spatial pooling”, 
whereas for p > 0 statistics pooling is restricted to occur over narrow bands, of width 
2r(l — p), running around the inside of the template contour. 
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Spatial pooling dilutes information contained in the gross spatial arrangement of 
the grey-level pattern. Sometimes this provides adequate selectivity for the observation 
likelihood, particularly when the object outline is distinctive, such as the outline of a 
hand as in figure 0 The outline of a face, though, is less distinctive. In the extreme 
case of a circular face, and using isotropic filters, rotating the face would not produce 
any change in the pooled response statistics. In that case, the observation likelihood 
would carry no information about (2D) orientation. One approach to this problem is 
to include some anisotropic filters in the hlter bank, which would certainly address the 
rotational indeterminacy. Another approach P?i|| to enhancing selectivity is to subdivide 
the interior T of the object d& T — TqVJ . . .VJTnp , and construct individual distributions 
{z\p = 0) for each subregion Ti. However, the choice of the number and shape of 
subregions is somewhat arbitrary. It would be much more satisfying to find a way of 
increasing selectivity that is tailored specihcally to foreground structure, rather than 
imposing an arbitrary subdivision, and that is what we seek to do in this paper. 



3.2 Warp Pooling 

In principle the foreground density p^ depends on the full warp Tx ■ This means that 
p^{z\p) must be learned not simply from one image, but from a training set of images 
containing a succession of typical transformations of the object, and this is reasonable 
enough. In principle, the learned p^ should be parameterised not merely by p{X), as was 
the case for the background, but by the full, multi-dimensional conhguration X itself, 
and that is not computationally feasible. One approach to this problem is that if these 
variations cannot be modelled parametrically, they can nonetheless be pooled into the 
general variability represented by p^{z\p). However, such “warp pooling” dilutes the 
available information about X, especially given that it is combined with spatial pooling 
as above. 



3.3 Foreground Distribution 

The predictable behaviour of filter responses over natural scenes, which applies well 
to background modelling, could not necessarily be expected to apply for foreground 
models. Filter response z over background texture assumes a characteristic kurtotic 
form, well modelled as by an exponential (Laplace) distribution The foreground. 




Fig. 4. Foreground and background distributions for support radius r = 20 pixels. The back- 
ground distribution has higher kurtosis, having extended tails. 
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being associated with just a single object, is less variable and does not have extended 
tails (figure E). Hence the exponential distrihution that applies well to the background 
@ is inapplicable and a normal distribution is more appropriate. 

As for independence, filter outputs over the background are known to be uncorrelated 
at a displacement of r or 3a but this need not necessarily hold over the foreground. No- 
netheless, autocorrelation experiments done over the foreground have produced evidence 
of good independence for V^G filters, as in figure|^(b). 



4 Conditioned Foreground Likelihood: Warping and Illumination 
Modelling 



It was demonstrated in sectionQlthat greater selectivity is needed in the foreground mo- 
del. Generally this can he approached by reducing the degree of pooling in the learning 
of . A previous attempt at this inhibited spatial pooling by subdivision, but this is not 
altogether satisfactory, as explained in the previous section. The alternative investiga- 
ted here simultaneously diminishes both warp pooling and spatial pooling. It involves 
warping a template image I, onto the test image / and taking the warped Tx{I) to be 
the mean of the distribution for I. This warping scheme is described in the next section, 
together with a further elaboration to take account of illumination variations. 



4.1 Approximating Warps 

Two-dimensional warps Tx could be realised with some precision, as thin plate splines 
IHIl . A more economical, though approximate, approach is proposed here. First the warped 
outline contour is represented as a parametric spline curve 0, over a configuration-space 
X, define to be a sub-space of the spline space. Then the warp of the interior of the object 
is approximated as an affine transform by projecting the configuration X onto a space 
of planar- affine transformations ||7| ch 6]. The fact that this affine transformation warps 
the interior only approximately is absorbed by pooling approximation error, during 
learning, into the foreground distribution p^ . The resulting warp of the interior then 
loses some specificity but is still “fair” in that the variability is fairly represented by 
probabilistic pooling. (A similar approach was taken with pooled camera calibration 
errors in mosaicing E1-) 

To summarise, the warp model is bipartite; an accurate mapping of outline contour 
coupled with an approximate (affine) mapping of the interior. The precision of the map- 
ped contour ensures that foreground/background discrimination is accurate, and this is 
essential for precise contour localisation. The approximate nature of the interior map- 
ping is however acceptable because it is used only for intensity compensation in which, 
especially with large filter scale cr, there is some tolerance. 
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4.2 Single Template Case 

Given a hypothesised warp Tx, the output z(x) of a filter Wx centred at x is modelled 
as 



z(x) = (fCx, Tx-i+n) = (FFx, Tx ■ l) + {Wx, n) (6) 

= z{^,X)+Yx 

where z(x, X) is the predicted filter output and where Yx is a random variable, whose 
distribution is to be learned, assumed to be symmetric with zero mean. It is the residue © 
of the predicted intensity from the image data and is likely to have a narrow distribution 
if prediction is reasonably effective as in figure^ Thus the distribution py is far more 
restrictive than p^. Using the Yx’s instead of the z(x)’s in the calculation of the global 
likelihood p{Z\X) results in more powerful and specific detection. 




(a) Template (b) Image Data (c) Differenced Image 

Fig. 5. Template subtraction.(a) The white contour marks the outline of the intensity template I. 
When subtracted from an image I (b), the residue (c) is relatively small, as indicated by the dark 
area over the face. 



Note that the predicted output z(x, X) can be approximated as 
5(x,X) (Tx- Wx*/)(x) 

which is computationally advantageous as the filtered template Wx * I can be computed 
in advance. The approximation is valid provided Tx is not too far from being a Euclidean 
isometry. (An affine transformation, which is of course non-Euclidean, will change a 
circular filter support S, and this generates some error.) 



4.3 Light Source Modelling 

A family of templates Ii, I k is generated corresponding to K lighting conditions, 
and typically K = 4 to span a linear space of shadow-free, Lambertian surfaces under 
variable lighting o. So the image data is can be modelled as 7 = Tx{oc • I) + n. Now 
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the predicted filter outputs are defined to be 

z(x,X,a) = (W'x,Tx(a-i)) = 

k 

= ^afeZfc(x,X) (7) 

k 

Illumination modelling in this way makes for better prediction allowing the distribution 
of the residual py to become even narrower (see figure 0. 

4.4 Joint and Marginal Distributions for Illumination- Compensated Foreground 

In order to preserve the validity of ©, the independence of the for sufficiently 
separated x should be checked. For instance, the correlation 

should — 0 sufficiently fast as |x — x'| increases. As figure 0 shows, the correlation 




Fig. 6. Foreground correlation. The correlation between filter outputs at various displacements 
is shown (black) for the resuidual between the image data and the template and this is very 
similar to the correlation of the z(x) (grey), and the Y^ obtained by taking illumination factors into 
account (light grey). Right: the foreground correlation (grey) is similar to background correlation 
(black). 



has fallen close to zero at a displacement of r, giving independence of adjacent outputs 
for the support-tessellation of figure El Correlafion funcfions for foreground and back- 
ground are broadly similar and so fit the same grid of filters. Finally, de-correlation is 
a necessary condition for statistical independence but is not sufficient. Independence 
properties can be effectively visualised via the conditional histogram E3l- FigureQ dis- 
plays histograms which estimate p{Y^, Y^' \ |x — x'| =6) where S = a, 2a, Str and x 
and x' are diagonally displaced (r = 3 ct). The greylevel in each histogram represents 
the frequency in each bin. White indicates high frequency and black none. From these 
it is clear that at the grid separation r = 3a, Y^, 1^' are largely independent. It might 
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4 

Template subtraction + illumination compensation 

Fig. 7. Joint conditional histograms of pairs of filter responses. As S increases the structure of the 
histograms decreases. When 5 — a the white diagonal ridge indicates the correlation between the 
filter responses. While at 5 = 3 <t this ridge has straightened and diffused. The two rows of figures 
are extremely similar and show that the template subtraction and illumination compensation have 
at most a marginal effect as regards whitening the data. 





have been expected that template subtraction, especially with illumination compensa- 
tion, would have significantly decreased correlation of the foreground but that was not 
the case. Where there is a significant effect is in the marginal distribution for which 
becomes significantly narrower, as figure 0shows. 




Fig. 8. Illumination compensation narrows py{Yx). Each of the graph displays py or 
p^{z) learnt from data at different stages of preprocessing, Grey:raw filter responses (p^(z}), 
Blackdemplate subtracted residual responses and Light Grey Template subtracted plus illumination 
compensated residual responses. 



This is a measure of the increased selectivity of modelling the foreground with 
template subtraction, especially when this is combined with illumination compensation. 
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5 Learning and Inference 



The goal is to infer the value of X from p{X\Z) via Bayes’ Rule and the constructed 
likelihood function p{Z\X). If the test data was labelled with the value of the a of 
the illumination/object inference would be straight forward. Modelling the illumination 
results in the fact that we have instead a) . In principle the correct way to proceed 

would be to integrate a out of p{Z\X, cx) to construct 



L{X)=p{Z\X)= [ p{Z\X,a)po{cx\X)da 
JoL 



( 8 ) 



However, due to the probable dimensionality of a. and the computational expense of 
exhaustively calculating p(Z|JA, a) it is not feasible to compute this integration numeri- 
cally. In fact maximisation of p{Z\X, a) over a. in place of integration is an well known 
alternative that is simply an instance of the model selection problem. A factor G{Z, X) 
known as the “generacity” factor (and has elsewhere been known as the “Occam” factor 
EHIn is a measure of robustness (Q of the inferred 6l — the stability of Z with respect 
to fluctuations in or. 



^ ^ p{Z\X,dc{X,Z)) 



(9) 



The generacity G is then the additional weight that would need to be applied to the 
maximised likelihood 

L{X) = L{X,6l) 

to infer the posterior distribution for X: 

p{X\Z) oc L{X)G{Z, X)po{X). (10) 

If G{Z, X) does not vary greatly then it is reasonable to use L{X) instead of p{Z\X). 



5.1 MLE for Illumination Parameters 

As stated it has been assumed that the residual variable in 0 is drawn from the 
stationary distribution py ■ The likelihood function for particular values of X and a is 
the product of three separate components, the likelihood of the hypothesised background, 
foreground and mixed measurements as: 

L{X,oc) = p{Z\X,a) = n n p(zMx)) ( 11 ) 

= LF{X,a)LB{X)LM{X) 

where Y{f.b,m} ths sets containing the foreground,background and mixed measu- 
rements. In the implementation of template subtraction and illumination compensation 
only the foreground measurements are affected. Therefore only Lf is dependent upon 
a. 
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Intuitively it would seem reasonable to solve for a. by maximising Lp CH with 

respect to a: 

a(Z, X) = argumxLi?(X, a) (12) 

and then proceed with a. fixed as a and p{Z\X, a) is used. A functional form of py 
is needed though in order to be able to differentiate equation El From figure 0 it is 
plausible to assume that py is a zero mean Gaussian with variance 7 ^. It then follows 
that 

Lf{X, a) ^ MVN(0, 7 ^/xxx) (13) 

where Ikxk is the identity matrix and MVN stands for the multi-variate normal distri- 
bution. Obviously maximisation of equationElis equivalent to the least squares mini- 
misation 

a{Z,X) = argrmn^(2:(xi) - z(xi, A,a))^ (14) 

n 

Thus in the factored sampling algorithm for inferring X the following is implemented. 
For each hypothesis Xh a corresponding MLE ah is calculated and the likelihood 
L{Z\X) is approximated by L{Z\X, ah). 




No illumination Modelling 




Illumination Modelling 



Fig. 9. Illumination modelling improves detection results. Layered sampling at two levels (i-40 
and 20 pixels) with the conditioned foreground likelihood model in which illumination is not 
modelled and it is. In the latter case a is inferred by its MLE value. The gross change in the 
illumination conditions foils the naive conditioned foreground likelihood. 
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For the experiments performed in figure 0 the MLE method is used to infer a. 
Flowever, it remains to he confirmed experimentally that G{Z\X) remains more or less 
constant. In this experiment a shadow basis was formed by taking three images with 
the point light source to the left, right and behind of the subject. Then a sequence of 
267 frames in which the light source moved around the subject was used as test data. In 
every fifth frame the face person was searched for using layered sampling with the con- 
ditioned foreground likelihood, independent of the results from the previous search. Two 
levels of layered sampling were applied (r = 40, 20) and 900 samples at each level. The 
prior for the object’s affine configuration space was uniform over cc, y— translation and 
Gaussian over the other parameters allowing the contour to scale to ±20% horizontally, 
vertically or diagonally and rotate 20 degrees from its original position. (Each of the 6 
parameters were treated independently). Using the proposed method the face was suc- 
cessfully located at each frame. However, when illumination was not modelled detection 
was not always successful. Two frames in which this happened are shown in figure El 
To see the results of the whole sequence please see http://www.robots.ox.ac.uk/ sulli- 
van/Movies/F acellluminated.mpg. 

5.2 Sampling Illumination Parameters 

In the previous subsection a method for inferring ot was described. This method though 
is not Bayesian. The alternative is to extend the state vector to X' = {X, a) and to 
sample this in order to obtain a particle estimate of p{X'\Z). This however, is likely to 
be computationally burdensome because of the increased dimensionality and also due 
to the broad prior from which a must be drawn. Usually no particular prior for a will 
be known and in accordance a uniform one will be generally used. 

The alternative is to use an importance sampling function |IH1 gx (o^) that restricts a 
to its likely range. It is possible to incorporate this importance function into the factored 
sampling process as follows. Draw a sample Xh from po{X). Given this fixed value 
of X draw a sample oth from gx^{a). The corresponding weight associated with the 
particle X'^^ = {Xh, och) is L{X'^ / gXf,{oi-h) (the denominator is the correction factor 
applied to compensate for the bias shown towards certain a. values). 

The most important question has yet to be answered. From where can an appropriate 
importance function g{oi) be found ? In fact we don’t have to look any further than the 
partial likelihood function Lp. From equation [O this can be approximated by a multi- 
variate normal distribution with diagonal covariance matrix. This then implies that a 
given a fixed value of X is also a multi-variate normal distribution whose covariance 
matrix and mean can be easily calculated. Allowing this distribution to be g{ct) results in 
an importance function that can be sampled from exactly and greatly narrows the range 
of possible ot values. 



6 Results and Conclusions 

Results of localisation by factored sampling, using the new conditioned foreground 
likelihood are shown next, and compared with the “partitioned foreground” approach 
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of Each figure displays an image plus the particle representation of the posterior 
distribution for the configuration of the target object. For clarity, just the 15 most highly 
weighted particles are displayed. The weight of each particle being represented, on a log 
scale, by the width of the contour. The black contour represents the mean configuration 
of the particle set. Three different sets of experiments were carried out. Firstly it was 
checked if the new likelihood was prone to highlighting the same false positives as the 
partitioned likelihood and this is investigated with the decoy test. Then does the new 
method work for face detection and finally can it detect other textured objects. 



The decoy test In figure [D it was shown using a face decoy that the partitioned fore- 
ground model was prone to ghost object hypotheses. Results of this experiment with the 
new, conditioned foreground likelihood are shown in figure Q2I Note the effect on the 
mean configuration (the black contour); for the partitioned foreground, the mean lies 
between the two peaks in the posterior. With the conditioned likelihood the posterior 
is unimodal however, as evidenced by the coincidence of the mean configuration with 
the main particle cluster. Experiments were carried out at one scale level r = 40 and 
using 1200 particles, uniformly distributed, the translational component of the prior 
being drawn deterministically (ie on a regular grid), for efficiency. For computational 



Partitioned foreground Conditioned likelihood 




Fig. 10. Conditioned foreground likelihood eliminates ghosting. Foreground partitioning pro- 
duces a bimodal posterior distribution (plainly visible from the position of the mean contour) 
while conditioned foreground gives a unimodal distribution. 



efficiency, multi-scale processing can be applied via “layered sampling” E3l and this 
is demonstrated with person-specific models, for two different people, in figure El The 
prior for the affine configuration space is uniform over x, y— translation and Gaussian 
over the other parameters allowing the contour to scale to ±20% horizontally, vertically 
or diagonally and rotate 20 degrees from its original position. (Each of the 6 configu- 
ration space parameters are treated independently) This prior is used for the rest of the 
experiments unless otherwise stated. 
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Fig. 11. Layered sampling demonstrated for individuals, using individual-specific models this 
figure. The prior for the position in the face is uniform in the x, j/— translation over the image. The 
search takes place over two scales (r = 20, 10) implemented via layered sampling, using 1500 
samples in each layer. 



Generalisation Experimentally, a model trained on one individual turns out to be capa- 
ble of distinguishing the faces of a range of individuals from general scene background. 
The experiment used the learnt model from figureE](a) and applied it to the images dis- 
played in hgure[0 Once again two levels of layered sampling were applied (r = 20, 10), 
now increasing the number of samples increased to 3000. This performance is achieved 




Fig. 12. Generalisation of face detection. Training on a single face generates a model that is still 
specific enough to discriminate each of a variety of faces against general scene background. 



without resorting to the more complex, multi-object training procedure of l7.ll though it 
remains to test what improvements in multi-object training would bring. 



Detecting various textured objects Finally, the conditioned foreground likelihood 
model has been tested on a variety of other objects, as in figure El Note that even 
in the case of a the textured vase resting against a textured sofa, the vase object is 
successfully localised. Given that the boundary edge of the vase is not distinct, edge 
based methods would not be expected to work well here. (Layered sampled was applied 
at scales r = 20, 10 pixels with 1200 particles in each layer.) The prior in the clown 
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example allows for a greater rotation, while in the shoe example the prior has been 
narrowed. 




Fig. 13. Textured inanimate objects can also be localised by the algorithm. Special note should 
be taken of the detection of the vase against the textured sofa. 



7 Discussion and Future Work 

7.1 Modelling Object Variability 

In addition to lighting variations, a further generalisation is to allow object variations. 
For example, in the case of faces, varying physiognomy and/or expression. This could be 
dealt with in conventional fashion iHhl by training from a set /j* , , . . . covering both 

object and illumination variations, and using Principal Components Analysis (PCA) 
to generate templates Ii, ... ,Ik that approximately spans the training set. Then the 
methodology of the previous section can be followed as before. 

Alternatively, it may be the case that the training set is explicitly labelled with il- 
lumination conditions k = 1, ... ,K and basis-object index j = 1, . . . , M, in which 
case the training set is organised as {Ijk} and these could be used directly as templates 
{Ijk}. Then a general image is 

I = Oijkijk = Pjjkijk 

j,k 3,k 



where jk weights light-sources and j3j weights basis objects. Thus the KM weights 
ajk applied to the templates decompose as /3j7fe, and so have just K + M degrees 
of freedom. This is a familiar type of bilinear organisation, the “style and content” 
decomposition O, that occurs also with the decomposition of facial expression and 
pose n . Imposing the bilinear constraint that a — , which stabilises the estimation 

of a, can be performed as usual by SVD. 

In this bilinear situation, the earlier model m is extended to take account of light 
source variations as follows. 

z(x, a:, A) = ^ ayfc%,fc(x, X) 

j,k 



( 15 ) 
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where 

Zj,k(x,X) = (W^,Tx ■ ij,k) ■ 
and A is a matrix whose entries are ajk- 
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Abstract This paper describes the development of a system for the 
segmentation of small vessels and objects present in a maritime environ- 
ment. The system assumes no a priori knowledge of the sea, but uses 
statistical analysis within variable size image windows to determine a 
characteristic vector that represents the current sea state. A space of 
characteristic vectors is searched and a main group of characteristic vec- 
tors and its centroid found automatically by using a new method of 
iterative reclustering. This method is an extension and improvement of 
the work described in [2j . A Mahalanobis distance measure from the cen- 
troid is calculated for each characteristic vector and is used to determine 
inhomogenities in the sea caused by the presence of a rigid object. The 
system has been tested using several input image sequences of static 
small objects such as buoys and small and large maritime vessels moving 
into and out of a harbour scene and the system successfully segmented 
these objects. 



1 Introduction 

Maritime vessels are today faced with the threat of piracy. Piracy is usually 
associated with the old swash buckling films and consequently we do not consider 
piracy in the modern age, however several incidents of piracy happen each day, 
particularly in the Mallaca straights and the South China Sea areas. Here fast 
RIB craft (Rigid Inflatable Boats) approach the stern of a large cargo ship, 
even super-tankers, and scale the ship using simple rope ladders. The small 
numbers of crew that these ships have on duty means pirate detection needs 
to be automated. Current Radar systems are of limited use in these situations 
as RIB craft are small almost non-metallic and consequently have poor radar 
returns and as such radar systems And them difficult to detect. To overcome this 
problem an image processing system is under development. 

The maritime scene, however, has been found to be extremely complex to 
analyse producing large number of motion cues making identification and 

tracking in the visual environment complex. The system being developed here 
concentrates on the task of extracting the maritime vessels and other static nauti- 
cal objects (buoys, mooring buoys, piers, etc.) from the sea to aid the recognition 
and tracking process. To accomplish this task three integrated algorithms have 
been developed, namely (i) variable size image window analysis, (ii) statistical 
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analysis by reclustering and (iii) region segmentor. The variable window analy- 
sis determines a set of overlapping image windows and, for each image window, 
calcualtes the energy, entropy, homogeneity and contrast. This vector effectively 
forms a four-dimensional feature for each image window. The statistical analyser 
uses a new method of iterative reclustering of the feature space to determine the 
centroid of vectors representing the main feature in the scene (sea) 0. The re- 
gion segmentor calculates the Mahalanobis distance between the values of the 
feature centroid in each image window which identifies outliers from the mean. 
These outliers are potentially regions that contain inhomogeneities, effectively 
forming a feature map |3], which may indicate the presence of a rigid object. 
These extracted regions effectively form regions of interest (ROI) in the image, 
and the region segmentor identifies these ROI’s in the original image sequence 
using white rectangular boxes. 




Figure 1. Typical nautical scene. 



2 Window Analysis 

The segmentation of a maritime scene is complicated by the fact that waves cause 
noise (undesirable regions of interest) that does not have a Gaussian distribution, 
and consequently traditional ways of filtering are ineffective. The main properties 
of this noise are spatial dependent i.e. its appearance. Fig. 1 shows a typical 
maritime scene and we can see that the noise is not distributed uniformly in the 
image, a noise ’pattern’ is formed which can clearly be seen in the bottom of the 
image. 
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Commonly used texture techniques (as described in many texts such as m) 
proved to be unsuccessful in describing the distribution of these noise patterns 
as they differ from scene to scene and frame to frame. However, from looking at 
a typical nautical scene, we observe a plane (sea level) that is almost parallel to 
the camera axis. 

The bottom of the image contains that part of the sea that is closest to the 
camera, while the horizon is made up of points at infinity on the sea level plane 
0. Therefore, the resolution of observation is larger for any objects that are 
close to the bottom of image than for objects that are closer to the horizon. This 
also holds true for the noise patterns. Using this observation we can see that a 
variable size image window segmentation technique will require finer (smaller) 
image windows as we approach the horizon, but courser (larger) image windows 
could be used closer to the bottom of the image. The variable image analysis 
algorithm is passed the position of the image horizon and an initial window size. 

Overlapping image windows are determined by growing the window size from 
an initial 16 by 16 pixels on the horizon line towards the bottom line of the 
image at a rate of 6% per window line. For our experiments we used rates 
from 5% to 10% depending on the camera angle under which the scenes are 
observed. The image windows are allowed to overlap by 33%. This effectively 
positions a grid on the sea plane as shown in Fig. 2. If we consider perspectivity 
then the correct shape of the projected grid tiles should be trapezoidal. This 
brings a complication to the process because we would have to use bilinear or 
other perspective transformation for each of the windows to transform it into 
a rectangle. These transformations are computationally intensive. However, it 
has been found that rectangles provide a good approximation of trapezoidal 
segments. The size of the windows and the amounts of overlays are stretched 
accordingly so the windows cover a whole region under observation and there 
are no uncovered ’blind spots’ on the sides and at the bottom of the image. 

Each window is then resized to the size of the smallest windows (a window 
near the horizon) by using either simple re-sampling or bilinear interpolation. 
Bilinear interpolation gives better results but is slower, while simple re-sampling 
gives poorer results but is much faster and for most applications is sufficient. The 
final task of the variable window analysis is to calculate the following statistical 
values jnj for each image window: 



R C 

energy = 



( 1 ) 



r=0 c—0 



R C 

entropy — EE log(P(r,c)) • P(r,c) . 



(2) 



r—0 c—0 




( 3 ) 
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R C 

contrast = EE (r — c)^ • P{r, c) . (4) 

r— 0 c— 0 

where r, c are row and column indexes, P(r,c) is the pixel value at position r,c 
and R, C are the image window boundaries. The calculated values are arranged 
to form a 4-element vector, giving N 4-element feature vectors, where N is the 
total number of windows in the segmentation. 




Figure 2. Variable Image Windows overlaid on the sea, minimum window size of 16 x 
16 pixels with 33% overlap and an expansion rate of 6%. 



3 Statistical Analyser 

We can consider the vectors calculated from the variable window analysis as 
a population of points in a 4-dimensional feature space. The statistical anal- 
yser determines a set of characteristic features that could be used to describe 
the current sea state. This set is represented by a main cluster in the feature 
space. The previous algorithm used to find the main cluster is described in j0|. 
This algorithm uses histograms that are constructed for each of the four pre- 
viously described characteristics. It divides the smoothed data histograms into 
subparts by local minima and assigns the largest subpart to the main cluster. 
This method does not perform well for smaller numbers of feature vectors. In 
these cases it becomes difficult to find the correct local minima because of the 
lack of data needed to create meaningful histograms. Another disadvantage is 
the presence of many thresholds whose settings influence the results significantly. 
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A new method is introduced that helps to approximate the distribution of the 
unlabelled feature vector data in feature space. The method takes feature vectors 
generated at the previous stage of the algorithm (variable size image windows 
analysis) as an input and it iteratively determines the centroid and the covariance 
matrix for the data in the main cluster. The problem here is that there is no 
useful knowledge about the data due to the nature of the problem (each scene 
segmented in the previous stage of the algorithm can be significantly different 
from the previous one in terms of sea appearance and presence of objects). The 
only usable knowledge is that there is a certain main cluster in the feature 
space which comprises the vectors corresponding to major features in the scene 
(presumably the sea). These vectors are relatively close to one another. Other 
vectors (outliers) represent regions where objects are in the scene and these 
vectors are relatively far from the main cluster and it’s centroid. Unfortunately, 
due to the nature of the problem, we cannot use learning and classification 
algorithms (as described in Shalkoff 0) as the feature data can change its values 
disobeying any rule at all. The distributions of feature data change from scene 
to scene and the only usable information is the presence of the main cluster and 
possible outliers. 

We assume that the main cluster contains the majority of vectors and that 
these vectors are relatively close to one another. Other vectors or groups of 
vectors (representing the objects) are positioned relatively far from this main 
cluster. Therefore, if we calculate the centroid of all the vectors in the distribution 
by using the mean, or better, median then we can assume that this centroid of 
all vectors is not far from the centroid of only the vectors in main cluster. That 
is, because there are many vectors close together whose position will bias (or 
attract) the position of the centroid determined as the median of all vectors in 
the feature space. Experiments have proved that median performed better than 
mean because median is not influenced by a small number of outlying vectors. 
The next step after determining the centroid of the whole distribution in feature 
space is to choose which vectors actually fall into the main cluster. We assume 
that the main cluster lies within the boundary that corresponds to the mean 
distance of all the vectors from the determined centroid. Thus, the resulting 
group of vectors has a centroid corresponding to the median of all vectors in the 
distribution and includes vectors with distance’s less than the mean distance of 
all the vectors in the distribution in the feature space. 

The next step is similar to the one described above: once again, we determine 
the median centroid but now we use only the vectors lying within the mean 
distance from the previous centroid. We recalculate the mean distance from the 
newly calculated centroid for all the vectors in the group. The new main cluster 
consists of the vectors that lie within the new mean distance from the new 
centroid. 

This process is repeated iteratively. The number of iterations is not signifi- 
cantly large as after each step the group of selected vectors shrinks significantly, 
especially if the main cluster is packed tightly together. Practical experiments 
proved that one to three iterations are sufficient. 
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We use the Mahalanobis measure to determine the distances among feature 
vectors: 



fc = (^ - 7 /) • - 7^)^ . (5) 

where k is the distance, is the feature vector, ~ft is the centroid and 
C~^ is the inverse of covariance matrix. The reason for using the Mahalanobis 
distance is that the data is highly correlated. The Mahalanobis distance used 
in this method is slightly modified - the centroid used in the formula is not 
determined as a mean but as a median. The reason for that, as stated above, 
is the avoidance of outliers. This method only determines the main cluster and 
it’s centroid approximately but as we haven’t got any prior knowledge about 
the data it is sufficient to determine the outliers that represent the regions with 
objects in the scene. Experiments proved that the separation of outliers from 
the main cluster vectors is by means of orders (value of Mahalanobis distance of 
outliers from the centroid is by a few orders higher than the distance of vectors 
in the main cluster) even for highly scattered feature vectors. Another important 
property of the method is the fact that it does not shift the centroid of the feature 
vectors significantly if the data is relatively consistent and does not contain any 
outliers. 

The main advantage of the algorithm is that there is no need for prior knowl- 
edge to approximate the distribution of the vectors in the main cluster. Another 
important advantage is the absence of any thresholds. The only value that is to 
be set is the number of iterations and as stated above, one to three iterations 
are sufficient. Figures 3a-3f show two iterations of the reclustering process in 2D 
projections of the feature space. 

The statistical analyser applies the method described above onto the feature 
vectors determined by variable size image windows analysis and it determines 
the Mahalanobis distance from the main cluster centroid for each of the vectors. 

4 Region Segmentor 

The statistical analyser has calculated the distances of the feature vectors from 
the centroid of the main cluster which represents the main feature in the image 
(presumably the sea), the region segmentor must now determine those image 
windows whose feature vectors have Mahalanobis distance above the set up 
threshold. The values of the Mahalanobis distance for each vector provide a 
measure of the likelihood of an image window being an object, the greater the 
distance value the more the likelihood of it being a vessel or other man-made 
object. Figure 5 shows the result of transforming the values of the Mahalanobis 
distance measure back into the image plane, the darker the image window, the 
greater the likelihood of that tile being a region of interest. The Mahalanobis 
distances are now scanned and the rate of change of the distance is calculated. 
If the rate of change is below a threshold value, the Mahalanobis distance is 
replaced with the minimum of that region. Finally Mahalanobis distances which 
have minimum values correspond to be the primary feature in the scene, namely 
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the sea (Fig. 4). The determination of the primary feature works even if the 
object covers the majority of the scene. The main feature then represents the 
object and outliers, determined by a large distance value, represent either sea or 
other smaller objects. 




Ulindou 



Figure 4. Mahalanobis distance of feature vectors from the centroid of the main cluster 
after process of homogenizing (distance values are substituted by local minima). 



5 Discussion 

A static camcorder was set up at the entrance to Portsmouth harbour and an 
image sequence showing small motor vessels and in particular RIBs moving out 
of the harbour was filmed. From this sequence a 1500 frame clip was digitised to 
disk at a rate of 10 frames per second. A second sequence was filmed at Poole 
harbour showing yachts and buoys moving in the scene and a third showing a 
medium sized vessel approaching a pier. 

The error rate of the segmentation was determined as a ratio between number 
of frames where the segmentation was incorrect (i.e., rigid objects present in 
scene were not found or false regions without any objects were marked) and total 
number of frames in each sequence. This ratio is stated in percentage terms. 

Figures 6a and 6b show the Portsmouth scene where a larger motor vessel 
led a procession of five smaller motor vessel out of the harbour. The algorithm 
correctly segmented the motor vessels 91% of the time, however, as the vessels 
moved across the scene several segmented regions were merged. This particular 
sequence included a number of RIBs. 

Figures 7a and 7b show the Poole scene where small and large yachts were 
moving into and out of the harbour entrance together with a small buoy. The 
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Figure 5. Mahalanobis distance transformed back to image plane. Blobs are positioned 
at the centers of windows used in segmentation. Brightness of the blob corresponds to 
the likelihood of object beeing present in the window. 



system segmented out the yachts 95% of the time, however, again as vessels 
crossed, the segmented regions were merged. The system did however, incorrectly 
segment the buoy 15% of the time but this is still an improvement on the results 
shown in |^. 

Figure 7a shows that algorithm has found only the bottom of the large yacht. 
The reason for this is, that the algorithm is segmenting only the sea. It ignores 
everything above the shore. Thus, this algorithm serves only as a partial solution 
of maritime scene segmentation task. 

Figures 8a and 8b show the ability of the system to correctly segment either 
static and moving objects in the scene even if these cover large areas of the 
image. 

6 Conclusion 

A method for segmenting static man-made objects and small vessels moving 
in a maritime scene has been developed and has been shown to provide reliable 
segmentation results for a number of maritime scenes. The algorithm uses simple 
mathematical operators to build a statistical character of the sea. A new method 
of feature space re-clustering is has been introduced for statistical analysis, based 
on the work first described in j0|. 

One advantage of the algorithm is the use of only the current image in the 
segmentation process, the algorithm does not rely on any change between con- 
secutive images to provide the regions of interest. It does not rely on any prior 
knowledge about the characteristics representing the sea. It efficiently eliminates 
the noise caused by the motion of the sea, and has demonstrated within the con- 
straints of the project that this is both scene and time independent. 

However, the algorithm as it stands requires initial start positions for the 
horizon and the minimum window size which must be passed to the algorithm 
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at the start of processing. It also requires the threshold value for separating 
the main feature from outliers which is the main drawback at the moment. It 
does not give an exact and final identification about any objects in the scene, it 
provides only a measure of the objects presence. 

Future enhancements to the algorithm are aimed at addressing automating 
the horizon identification, determining a function for homogenising the Maha- 
lanobis distance measure to preserve outliers and using connectivity analysis 
to produce improved object detection. The future development is also oriented 
to find and process the temporal correspondence of the detected regions in the 
sequence. 

Another important enhancement to the algorithm is aimed at substituting 
the final thresholding of the Mahalanobis distances with a clustering algorithm 
that connects the regions with similar Mahalanobis distances. A good description 
of such a clustering algorithm is given in |S| . 
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(a) 



(b) 



Figure 6. Procession of small motor vessels (a) frame 200, (b) frame flOO. 




Figure 7. Large and small yachts and a buoy, (a) frame 300, (b) frame 900. 
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(a) 



(b) 



Figure 8. Medium sized vessel approaching a pier, (a) frame 100, (b) frame 200. 
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Abstract. A new probabilistic background model based on a Hidden Markov 
Model is presented. The hidden states of the model enable discrimination between 
foreground, background and shadow. This model functions as a low level process 
for a car tracker. A particle filter is employed as a stochastic filter for the car tracker. 
The use of a particle filter allows the incorporation of the information from the 
low level process via importance sampling. A novel observation density for the 
particle filter which models the statistical dependence of neighboring pixels based 
on a Markov random field is presented. The effectiveness of both the low level 
process and the observation likelihood are demonstrated. 



1 Introduction 



The main requirement of a vision system used in automatic surveillance is robustn- 
ess to different lighting conditions. Lighting situations which cast large shadows are 
particularly troublesome (see figure[I]) because discrimination between foreground and 
background is then difficult. As simple background subtraction or inter-frame differen- 
cing schemes are known to perform poorly a number of researchers have addressed the 
problem of finding a probabilistic background model 1611711011312011 . Haritaoglu et al. 
|5| only learn the minimal and maximal grey-value intensity for every pixel location. The 
special case of a video camera mounted on a pan-tilt head is investigated in El- Here a 
Gaussian mixture model is learnt. Paragios and Deriche lO demonstrate that a backgro- 
und foreground/segmentation based on likelihood ratios can be elegantly incorporated 
into a PDE Level Set approach. In order to acquire training data for these methods it 
is necessary to observe a static background without any foreground objects. Toyama et 
al. I2l|| address the problem of background maintenance by using a multi-layered ap- 
proach. The intensity distribution over time is modelled as an autoregressive process of 
order 30. This seems to be an unnecessarily complex model for a background process. 
None of the above models are able to discriminate between background, foreground, 
and shadow regions. In the present paper we propose a probabilistic background model 
based on a Hidden Markov Model (HMM). This model has two advantages. Firstly it is 
no longer necessary to select training data. The different hidden states allow the learning 
of distributions for foreground and background areas from a mixed sequence. By adding 
a third state it is possible to extend the model so that it can discriminate shadow regions. 
The background model is introduced in sectional 

In addition to the low level process it is necessary to build a high level process that 
can track the vehicles. Probabilistic trackers based on a particle filters [Ql are known to 
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be robust and can be extended to tracking multiple objects Gn. The benefit of using 
a particle filter is that the tracker can recover from failures |7I|. But very importantly 
the use of a particle filter also allows a way to utilise the information of the low level 
process modelled by the HMM. The propagated distrihution for the previous time-step 
t — 1 is effectively used as a prior for time t. It is very difficult to fuse two sources of 
prior information. However, importance sampling, as introduced in |0, can he used to 
incorporate the information obtained from the low level process. Instead of applying the 
original algorithm an importance sampling scheme which is linear in time | |1 b| | is used 
here. The importance function itself is generated by fitting a rectangle with parameters 
Xj to the pixels which are classified as foreground pixels (see figure 0 and using a 
normal distribution with fixed variance and mean Xj as the importance function. The 
remaining challenge is to build an observation likelihood for the particle filter which 
takes account of spatial dependencies of neighbouring pixels. The construction of this 
observation likelihood is discussed in in sectionOl We demonstrate that hy employing a 
Markov random field it is possible to model these statistical dependencies. 

Such a car tracking system has to be able to compete with existing traffic monitoring 
systems. Beymer et al. [2] built an very robust car tracker. Their tracking approach is 
based on feature points and works in most illumination conditions. The disadvantage 
of the system is that it is necessary to run a complex grouping algorithm in order to 
solve the data association problem. The use of additional algorithms would be necessary 
to extract information about the shape of the cars. By modelling cars as rectangular 
regions it would he possible to infer about their size and allow classification into basic 
categories. Roller eta/. [ilOj as well as Perrier |E] etal. already demonstrated applications 
of contour tracking to traffic surveillance. (IffJ extracts a contour extraction from features 
computed from inter-frame difference images as well as the grey value intensity images 
themselves. In the case of extreme lighting conditions as shown in figured] this system 
is likely to get distracted. Approaches which model vehicles as three dimensional wire 
frame objects II KI1 211 .31 are of course less sensitive to extreme lighting conditions. The 
main drawback of modelling vehicles as three dimensional objects is that the tracking is 
computationally expensive. The challenge is to design a robust real-time system which 
allows the extraction shape information. 



2 A Probabilistic Background Model 

In addition to being able to discriminate between background and foreground it is also 
necessary to detect shadows. Figure Qclearly shows that the grey-value distributions of 
the shadow differs significantly from the intensity distributions in the foreground and 
background regions. This is the motivation for treating the shadow region separately. 
Since all three distributions have a large overlap it is of course not possible to construct 
a background model which is purely based on intensity values. However another source 
of information is available; the temporal continuity. Once a pixel is inferred to be in a 
foreground region it is expected to be within a foreground region for some time. An sui- 
table model to impose such temporal continuity constraints is the Hidden Markov Model 
(HMM) . The grey-value intensities over time for one specific pixel location is to be 
modelled as a single HMM, independent of the neighbouring pixels. This is of course an 




338 



J. Rittscher et al. 




Fig. 1. A traffic surveillance example. This is a typical camera image from a traffic surveillance 
camera. Notice that especially for dark coloured cars intensity differences between foreground 
and background are small. In order to track the cars robustly it is necessary to detect the shadows 
as well as the cars. 



unrealistic independence assumption. The spatial dependencies of neighbouring pixel 
locations will be modelled by the higher level process (see sectionEJ- The reader should 
note that the specific traffic surveillance situation (see hgure QJ is particularly suited to 
investigate this class of model because the speed of the cars does not vary greatly. It 
is therefore possible to learn parameters which will determine the expected duration a 
pixel belongs to a foreground, shadow or background region. 




Fig. 2. Intensity histograms of the different regions. Intensity values for single pixel positions 
were collected from a 30 seconds long video sequence and manually classified into the regions: 
foreground, shadow or background. The intensity histograms of the different regions clearly show 
a large amount of overlap. A method which is purely based on grey-value intensities is therefore 
inadequate for this problem. 



The model parameters of the HMM with N states are the initial state distribution 
7T = {tti}, the state transition probability distribution A — {aij}, and the emission 
or observation probability for each state pf{z),pb{z) and Ps{z). The set of parameters 
dehning the HMM model will be abbreviated as w := {A, TT,pf,ps,pb). Standard texts 
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include im. Based on the intensity histograms of figureQthe emission models of the 
background and shadow regions are modelled as Gaussian densities. Since very little 
about the distribution of the colours of vehicles is known, the observation probability of 
the foreground region is taken to be uniform. Hence 



Pf{z) 



1 



Ps{z) 



P 2^ 

\/27rcr2 



and pb{z) 



y27rof 

( 1 ) 



It is of course possible to employ more complex emission models. In section IT^ it will 
be shown that is in fact necessary to use a more complex model for the observations. 



2.1 Parameter Learning 

For a given training sequence the model parameters are estimated by using a maximum 
likelihood approach. Because the model has hidden parameters an expectation maximi- 
sation (EM) type approach is used. In this particular case the Baum Welch algorithm 0 
is applied as a learning algorithm. Because EM-type algorithms are not guaranteed to 
find the global maximum and are very sensitive to initialisation it is necessary to explain 
how the initialisation is done. In order to hnd an initialisation method the following time 




Fig. 3. Learnt emission models. Shown is a set of emission models for one pixel location. The 
distributions Pf,Ps and pb model the intensity distributions for all three states foreground, shadow 
and background. It should be noted that the emission models can vary between pixel locations. 



constants are defined: - the typical time duration a pixel belongs to the background, and 

Tg, Tf the typical duration for shadow and foreground. Let Xb,Xs, and A/ be the propor- 
tion of the time spent in background, shadow and foreground, with A/ -f Ag -I- A;, = 1. 
All these parameters are determined empirically. Using these dehnitions an intuitive 
transition matrix can be chosen as 

/ 1 Als/ ff, \ 

r~^Abs , ( 2 ) 

\ A}js Tg Ag}) 1 T j; J 

where Aij = Xi/{Xi + Xj). The initial state distribution tt is chosen to be tt = 
{Xb, Xs, Xf}. The mean of the observation density for the background state pb can be 
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estimated to be the mode of the intensities at a given pixel since A{, ^ As and A& ^ A/. 
The variance cr^ is determined empirically. The initial parameters of the observation 
density for the shadow region are based on the assumption that the shadow is darker 
than the background, i.e. 



/is = 



+ 2(76 
2 



and 




(3) 



This ensures that /is < /ifeincase/ib > 2(T6, i.e. the background intensities are not as low 
as intensities in the shadow areas. At each iteration of the Baum Welch algorithm, the 
backward and forward variables are rescaled for reasons of numerical stability ['2'J. It is 
not necessary to learn a transition probability distribution A for every pixel. By learning 
one transition probability distribution for an observation window the complexity of the 
learning is reduced considerably. A set of learnt emission models are shown in figure 01 
The corresponding transition probability distribution is of the form 



/0.986 0.012 0.001 \ 

A= 0.013 0.884 0.101 , (4) 

\0.033 0.025 0.941 J 



A close inspection of these transition probabilities reveals that during learning dark 
cars are mistaken for shadows. As a consequence the expected duration for being in a 
foreground state is unrealistically short. For the particular lighting situation (see figure 
0 it is possible to solve the problem by adding the constraint a/s = 0. This implies 
that the transition probability from foreground to shadow should be zero. Of course this 
constraint cannot be applied in the general case. It is therefore necessary to find a more 
general solution. As a result the parameters of the observation density for the shadow 
change. Especially the variance as is now smaller as = 41.95 instead of 44.97. The 
corresponding transition matrix A is 

/0.980 0.015 0.003\ 

A= 0.013 0.897 0.891 , (5) 

\0.047 0.000 0.952 J 

notice that the values of a// is increased. 



2.2 Two Observations Improve the Model 

Initial experiments show that by using only one observation, dark cars are not detected 
sufficiently well (see figureE|). In order to make the method more robust, it is desirable to 
reduce the amount of overlap of the observation densities. In particular it is necessary to 
reduce the ambiguity between dark foreground regions and shadows. These ambiguities 
can be reduced by introducing a second observation. To be precise the responses of two 
different filters will be used. The HMM is no longer modelled for every pixel but for 
sites on a lattice such that the filter supports of the different sites do not overlap. As 
a first observation a simple 3x3 average is used. It can be observed that background 
and shadow regions are more homogeneous than foreground regions. It would therefore 
make sense to introduce a second observation which measures the intensity variation in 
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a small neighbourhood each pixel. In order to test this approach a simple 3x3 Sobel 
filter mask is used as a second observation. It is possible to show empirically that for this 
specific data, the responses of the Sobel filter and the mean intensity response at a pixel 
are uncorrelated. Hence the two observations are considered to be independent. The 
comparison shown in figure 0] shows that the use of two observations greatly improves 
the detection of dark cars. Whereas the choice of the average filter is justified the chosen 
Sobel filter is by no means optimal. A filter which implies computing a higher order 
derivative of the image data as for example a Laplace filter or even a spatio temporal 
filter might be a much better alternative. 




Fig. 4. Using two observations improves the model. For each time step t every pixel is classified 
to be in a foreground, background, or shadow region. For visualisation purposes the pixels for 
which the forward probability p{zt, zt-i, Yt = f\u>) is greater than the forward probability for 
the alternative states are marked in black. The image on the left shows the raw data. The black 
box indicates the area in which the model is tested. The two images on the right show the sets 
of pixels which are classified as foreground pixels. It shows that the classification based on two 
observations (right) is superior to the method based on only one measurement (middle). 



2.3 Practical Results 

In order to test the performance of the model the forward probabilities 2 (_i, Yt|a;) 
are evaluated for the three different states Yt G {f,b,s} for each time-step t. The discrete 
state Yt for which the forward probability is maximal is taken as a discrete label. By 
determining discrete labels this classification method discards information which could 
be used by a higher level process. But for now this should be sufficient to discuss the 
results obtained with the method. Two typical results are shown in figure 0 A movie 
which demonstrates the performance of this process can be found in the version of this 
paper on our web site (http://www.robots.ox.ac. uk/^vdg). The interior of the car is not 
detected perfectly. But there is clearly enough information to detect the boundaries of the 
vehicle. In order to illustrate the importance of the state transition probability the matrix 
A was altered by hand. The results are presented in figure^and display clearly that the 
transition probability plays an important role. The effect is of course most evident when 
the discrimination based on measurements alone is ambiguous. 

3 The Car Tracker 

The remaining challenge is to build a robust car tracker. Probabilistic trackers based on 
a particle filters m are known to be robust and can be extended to tracking multiple 
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Fig. 5. Results of the background modelling. The discrete label Yt for which the forward 
probability p{zt, Zt-i, Yt\uj) is maximal is used as a discrete label for visualisation (see text). 
Foreground pixels are marked in black, shadow pixels in grey, and background pixels in white. It 
should be noted that even for dark coloured cars the results are respectable. The labels will then 
be used by a higher level process to locate the vehicles. 




Fig. 6 . Importance of the temporal continuity constraint. Like in figure Q the pixels are 
assigned a discrete label Yt as which forward probability p{zt, zt-i, Ft|u) is maximal. In this 
experiment the transition probability of a model which uses two observations was altered such 
that all atj = 1/3 in order to explore the importance of the temporal continuity constraint. Each 
pixel is classified (see text) as foreground (in black), shadow (in grey) or background (white). A 
comparison with the images shown infigure^shows that these results are clearly worse. Obviously 
the transition probability A plays a crucial role. 
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objects uni. In order to build such a tracker it is necessary to model the observation 
likelihood 



p{Z\X-,§) 



( 6 ) 



for a set of measurements Z and a hypothesis X. The parameters of the model are 
denoted hy 'd. For the present purpose it is sufficient to model the outlines of the cars as 
a perspectively distorted rectangle which will be parameterised by the state vector X. 
In order to track cars rohustly it is not sufficient to take edge measurements as in m. 
d showed that detection of the background aids finding the foreground object. The 
problem is that in this case the measurements Z cannot be assumed to be independent 
(also see Cl). These conditions lead us to model the likelihood as a conditioned 
Markov random field (MRF) (see for example igmm . In Gibbs form an MRF can be 
written as 



P{Z\X-^) 



exp{-H^{Z,X)) 



(7) 



The denominator of the fraction is known as the partition function of the MRF. The 
difficulty is now to find a model which is tractable yet still captures the spatial dependence 
of neighbouring measurements. 



• — • — • — • — • — • — • 



• — • 



( 



2-dim. lattice vertical scan-lines horizontal scan-lines 

Fig. 7. Neighbourhood structure of the MRF. The set of sites on a lattice S is marked by circles. 
The neighbourhood structure at one particular site s (marked as a filled black circle) is different 
in each case. The neighbours r G S(s) of the site s are marked by black circles which are filled 
grey. The set of cliques are indicated by lines connecting neighbouring sites. 




3.1 Modelling the Observation Likelihood 

As mentioned in the previous section, one difficult problem is to find an energy function 
H for which the likelihood P{Z\X; i?) can be evaluated efficiently. The energy function 
iF will depend on a lattice S' and a corresponding ne/g/tfioMr/zoot/ i5 := {<5(s) : s G 
S} (see figure^. The set of cliques will be denoted by C. In order to take the distribution 
of the measurement ^ at a given site and the statistical dependence of measurements at 
neigbouring sites into account we let the energy function 

Hi{Z, X)= Y, 9a{Zs) + Y ^A-{Zs- Zrf , 

s6Ax (s,r)eCnA|. 



( 8 ) 
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where Ax denotes an area which is either in the foreground or background, i.e. X € 
{B, F}. Since the function gx models the distribution of the measurement at a given 
site it would be ideal if one could make use of the emission models which were learnt 
for the different states of the HMM (see section im . But as it will be shown later the 
energy function needs to be translational invariant (H3 so therefore gA cannot depend 
on a particular site s. And in order to compute the partition function efficiently (section 
o it is necessary that the functions gf and gt are normal distributions. The foreground 
distribution 5 / is therefore chosen to be a normal distribution with a large variance. The 
background distribution gf is taken to be the normal with mean g f and variance a / such 
that it approximates the the mixture of the background and shadow emission models C]) 
learnt by the HMM. 

The set of sites which belong to a given area Ax depends of course on the hypothesis 
X. Because the partition function depends also on X it will be necessary to evaluate it 
for every hypothesis X. It turns out that if the lattice S is two dimensional, the partition 
function is too expensive to compute. In the following it is explained that it is not possible 
to approximate the observation likelihood Ql. It is therefore necessary to find a simpler 
model. It is known that under certain conditions the pseudolikelihood function film , 
defined as 



( 9 ) 

ses 



can be used for parameter estimation instead of the Maximum Likelihood approach ba- 
sed on the MRF JTJ. It can be shown Ullll that estimators obtained by maximizing the 
pseudolikehood can compete in terms of statistical properties with maximum likelihood 
estimators. Although some authors state that when the variables are weakly correlated, 
the pseudolikehood is a good approximation to the likelihood J 3 it seems to be an open 
problem under which conditions precisely it can be used as an approximation to the 
likelihood function. In section 13.21 it will also become clear why the pseudolikehood 
method cannot be used to estimate X. An alternative is to restrict the MRF to measure- 
ments on scan lines taken out of the image. This will simplify the model considerably. 
The observation likelihoods of the different scan lines will be treated as independent. 
Based on the grid in figureEJit is possible to formulate a random field for each of the 
horizontal {hi} and vertical lines {ui}. The likelihood is now of the following form: 



p{Z\X;^) 



exp{-{H^ + Hf){Z,X)) 
yEzezexpi-{H^s + Ht){Z,X)) ’ 



( 10 ) 



where {^} is the set of lines on the grid. The energies and Hp are defined as in ® 
except that the neighbourhood system has changed (see figureQl. The partition function 
for the set of lines can be written as 



^ exp(-(H^ + iF]?)(Z,A)) = n E ^M-H%){Z,X)) 

Z£Z i Z^Zi 



where for every i ^ j one has Zi fl Zj = %.So Z is union of mutually disjoint sets Zi. 
Therefore it is now possible to compute the partition function because it only depends 
on line segments which are entirely in the foreground or background. 
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3.2 Learning the Parameters of the Random Field 

Learning the model parameters by a maximum likelihood method is computationally 
expensive ED- And as mentioned above, maximising the pseudeolikelihood © with 
respect to d leads to an effective estimator for d. For reasons which will be apparent 
later we consider the pseudolikelihood for a observation window T C S which is entirely 
in the foreground or background. That implies that the conditioning on the hypothesis X 
can be ignored for this analysis. The energy function of p{zs \ zs(^s ) ; is in this case equal 

to the neighbourhood potential. The logarithm of the pseudolikelihood for an observation 
window T C S has the form 



PLt{Z; ^) = Y. 



sGT 



g{Zs) + ■dVsizsZs(s)) - ln^exp(-'i?l/s(zsZ5(s))) 



( 12 ) 



where V is defined as Vs ■= J2reS(s}(^s — -Zr)^- The neighbourhood potential must 
satisfy a special spatial homogeneity condition. The potential is shift or translational 
invariant if for all s, f, rt € S' 



teS{s)i — >t + ueS{s + u) and Vc+u{zs-u) = Vc{zs) ■ (13) 

Furthermore a parameter d is said to be identifiable if for every d’ G 0 there is a 
configuration Z such that 



p{Z-J)^p{Z;d’) . (14) 

The maximum pseudolikelihood estimator for the observation window T maximises 
PLt{Z, •). If the potential is translational invariant and the parameter d is identifiable 
Winkler 12 1 ll (Theorem 14.3.1 on page 240) proves that this estimator is asymptotically 
consistent when the size of the observation window increases. Winkler also proves that 
that the log of the pseudolikelihood PLt is concave. In the present setting it is of course 
necessary to learn the parameters for the foreground and background energies Hf, and 
Hg separately. Since the PLx is concave it is possible to use a standard gradient decent 
algorithm to find the maximum of the log pseudolikelihood. In order to compute the 
gradient of the log pseudolikelihood it is desirable that the potential only depends on the 
parameters linearly. The gradient of the log pseudolikelihood can be written as 

XPLT{Z-,i}) = Y,[yizsZsis))-E{V{ZsZsis))\zs(s);^)] , (15) 

sGT 

where i?(I/ {ZsZs(^s))) denotes the conditional expectation with respect to the distribution 
p{zs\zs(s)'i '^) on Zg. The graphs of the pseudolikelihood can be found in figurelSl 

3.3 Computing the Partition Function 

The main reason for adapting a one dimensional model was the problem of computing the 
partition function of the observation likelihood (US. Due to equation (TiTI) it is possible 
to to compute the partition function by precomputing 
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Background Foreground 




horizontal vertical horizontal vertical 



Fig. 8. Pseudolikelihood of training data. The pseudolikelihood is plotted for different 
values off}. The distance between neighbouring sites d is set be d = 4. for horizontal and d = 2 
for the vertical lines. Because we work on fields, d differs for horizontal and vertical lines. It should 
be noted that there is a difference between the models. The functions are concave, as expected. 



Bn := X! and Fn '= ^ -exp(iJ^(Z)) , (16) 

where vector of measurements Z has length N . Rather than computing the value of the 
partition function for a particular hypothesis X it is desirable to compute a factor a{X) 
such that 



Y^eM-{H% + H%){Z,X)) = a{X)C , (17) 

z^z 



where C is some constant. Now the problem of computing Bjq and Fjv needs to be 
addressed. The energy functions can be written as a quadratic form, i.e. Hg{Z) = 
Z*MZ. The matrix M is of the form 



/(A + r?) -d 0 ••• 0 \ 

-fi {X + 2fi)-'d ■■■ 

0 0 ■■■ ■■■ 0 

: : -r?(A + 2r?) -fi 

V 0 0 ••• (X + fi)/ 



( 18 ) 



The matrix M is symmetric so it is possible to approximate as 



BN=y exp(-Z*MZ) Ri / exp{-ZNZ)dZ = (27r)^/^ det(M)-3 . (19) 
Since g/ and gi, are normal distributions this approximation holds for Bn as well as Fn- 



3.4 Results 

The observation likelihood p(Z|Af) as defined in ® was tested on a set of single images. 
The results are summarised in figured Whereas the results for horizontal and vertical 
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Horizontal Translation 



■ In p(ZIX) 




independent model 




Scaling 

In p(ZIX) 




learnt MRF 



Fig. 9. Log-likelihood for horizontal translation and scaling. The horizontal translation and 
scaling of the shape template is illustrated infigure\^ For both the horizontal translation and the 
scaling the log-likelihood for the independent model ft? = 0) (left) and the MRF with the learnt 
parameter vartheta (see figure^. The parameters for the intensity distributions gy and Qs are 
= 25, pb = 102, (jj = 600, pb ~ 128. The results obtained for the scaling clearly need to be 
improved. See text for discussion. 



translation are good the results obtained for the scaling of the foreground window are 
poor. In order to test whether the MRF has any effect dp and dp are set to zero which 
is equivalent to assuming that two neighbouring measurements are independent. The 
graphs in figured show that the modelling the statistical dependence of neighbouring 
measurement using the MRF does have an effect. As a first step to improve the model the 
neighbourhood structure was changed hoping that the interaction terms Vg (H '2i would 
have a greater effect. Now every pixel location on a scan lines is a site for the MRF. The 
resulting energy function is 

Hi{Z,X)= 9Aizg)+ Y. ^A-izg-Zg+df . ( 20 ) 

sGAx i5(s)gAx 

Only the distance between neighbours depends on a predefined spacing d. The results of 
this improved method are shown in figuresQniand[n] The fact that the results obtained 
with the new observation likelihood (EOJ are better shows that the MRF is very sensitive 
to the chosen neighbourhood structure. This raises the question if there is any way to 
determine an optimal neighbourhood structure automatically. The hand-picked MRF we 
chose might not be the best after all. 

A more ambitious step would be to construct a observation likelihood which makes 
use of the forward probabilities p{zt, Zt-i, Yt = f \ co). This would complicate the 
computation of the partition function. But based on the encouraging results we obtained 
from the HMM (see figure |3) this could lead to a far more powerful model. It can be 
concluded that the MRF does the right thing but needs to be improved so it can be used 
in a tracker. 



4 Conclusion 

Both a new probabilistic background model as well as a observation likelihood for 
tracking cars are presented. Although the background model is particularly suited to the 
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Fig. 10. Log-likelihoods for the improved model. Similar to figure 0 the log-likelihoods are 
shown for horizontal and vertical translation as well as scaling using the improved model defined 
in The model parameters itself are chosen as in figure]^ Although the maximum for the 
horizontal translation is not at zero figurel]J\demonstrates that the most likely hypothesis leads 
to a correct localisation. 



traffic surveillance problem it can be used for a wide range of application domains. The 
results presented in figure E]show that the use of this background model could lead to a 
robust tracker. The observation likelihood itself however still needs to be improved. The 
contribution this paper makes can he summarised as follows. 

Probabilistic background model. Unlike many other background models the model 
presented here is capable of modelling shadow as well as foreground and background 
regions. Another considerable advantage of this model is that it is no longer necessary 
to select the training data. HMMs are a suitable model for this problem as they impose 
temporal continuity constraints. Although using two observation did improve the results 
significantly the choice of filters is not optimal. The results presented in figure Elsupport 
the claim that it is crucial to model the transition probabilities correctly. 

Car tracker. In order to build a robust car tracker it is necessary to model the inside of the 
vehicles as well as the background and the statistical dependence of neighbouring pixels. 
This is possible by modelling an observation density used in a particle hlter which is 
based on an MRF. However it has to he noted that the MRF is very sensitive to the choice 
of the neighbourhood system. It remains an open problem which neighbourhood system 
is optimal. The formulation of the MRF based on scan-lines leads to a model which is 
computationally tractable. It should be noted that the presented observation likelihood 
is consistent with a Bayesian framework since the measurements do not depend on the 
hypothesised position of the vehicle. The use of importance sampling makes it possible 
to feed the information of the low level process into the car tracker in a consistent fashion. 

Future work. Since the illumination changes throughout the day it is necessary to derive 
a criterion when the the parameters of the background model need to be updated. It is 
furthermore necessary to investigate how the observation density can be improved. 

Acknowledgements. We are grateful for the support of the EPSRC and the Royal Society 
(AB) and the EU (JR). 
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Fig. 11. Observation vrindow and scan lines of the car tracker. The right image illustrates the 
grid used by the algorithm. The observation window is marked in black. The measurements are 
taken on scan-lines (in white). The hypothesised position of the car is shown in dark grey. The 
other two images illustrate how well the improved model localises. The most likely hypothesis is 
shown as a solid black line. The dashed lines illustrate the minimal and maximal configurations 
of the variation. See. fieures M ()\ n.nd^for the corresponding log-likelihood functions. 
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Abstract. An experimental vehicle is being developed for the purposes 
of precise crop treatment, with the aim of reducing chemical use and 
thereby improving quality and reducing both costs and environmental 
contamination. For differential treatment of crop and weed, the vehicle 
must discriminate between crop, weed and soil. We present a two stage 
algorithm designed for this purpose, and use this algorithm to illust- 
rate how empirical discrepancy methods, notably the analysis of type I 
and type II statistical errors and receiver operating characteristic curves, 
may be used to compare algorithm performance over a set of test ima- 
ges which represent typical working conditions for the vehicle. Analysis 
of performance is presented for the two stages of the algorithm sepa- 
rately, and also for the combined algorithm. This analysis allows us to 
understand the effects of various types of misclassification error on the 
overall algorithm performance, and as such is a valuable methodology 
for computer vision engineers. 



1 Introduction 

Economic and ecological pressures have led to a demand for reduced use of che- 
mical applicants in agricultural operations such as crop and weed treatment. The 
discipline of precision agriculture strives to reduce the use of agro-chemicals by 
directing them more accurately and appropriately. The extreme interpretation 
of this approach is plant scale husbandry, where the aim is to treat individual 
plants according to their particular needs. An experimental horticultural vehicle 
has been developed to investigate the viability of plant scale husbandry, and 
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previous work mm has described a tracking algorithm, centred upon an 
extended Kalman filter, that allows navigation of the vehicle along the rows 
of crop in the field. This paper presents a simple algorithm for frame-rate seg- 
mentation of images for the task of differential plant treatment, together with 
a thorough evaluation of algorithm performance on data captured from the ve- 
hicle. The algorithm comprises two stages. Stage I aims to extract image features 
which represent plant matter from the soil background, and stage II divides these 
features into crop and weed classes for treatment scheduling. 

The practical application of the algorithm requires that we understand how 
its performance varies in different operating conditions; Haralick m underlines 
the necessity of the evaluation of computer vision algorithms if the field is to 
produce methods of practical use to engineers. In this paper, we evaluate the 
two stages of the algorithm separately and as a result, we are able to gain deeper 
insight into the performance of the algorithm as a whole. A review of techniques 
for image segmentation evaluation is presented by Zhang m. who partitions the 
methods into three categories; analytical, where performance is judged on the 
basis of its principles, complexity, requirements and so forth; empirical goodness 
methods, which compute some manner of “goodness” function such as unifor- 
mity within regions, contrast between regions, shape of segmented regions; and 
finally, empirical discrepancy methods, which compare properties of the segmen- 
ted image with some ground truth segmentation and computes error measures. 
Analytic methods may only be useful for simple algorithms or straightforward 
segmentation problems, and the researcher needs to be confident of the models 
on which these processes are based if they are to trust the analysis. Empirical 
goodness methods have the advantage that they do not force the researcher to 
perform the onerous task of producing ground truth data for comparison with 
the segmentation, for meaningful results, an appropriate model of “goodness” 
is required, and in most practical problems if such a model were available, it 
should be used as part of the algorithm itself. This leaves empirical discrepancy 
methods, which compare algorithmic output with ground truth segmentation of 
the test data and quantify the levels of agreement and/or disagreement. 

A discrepancy method which is suitable for two-class segmentation problems 
is receiver operating characteristic (ROC) curve analysis. Rooted in psychophy- 
sics and signal detection theory, ROC analysis |?^l‘21j has proved popular for the 
comparison of diagnostic techniques in medicine PEI, and is gradually gaining 
currency within the computer vision and image analysis community for the com- 
parative evaluation of algorithms such as colour models | 2 |, edge detectors m 
and appearance identification |||. Receiver operating characteristic curves typi- 
cally plot true positive rates against false positive rates as a decision parameter 
is varied and provide a means of algorithm comparison and evaluation. ROC 
analysis also allows selection of an operating point which yields the minimum 
possible Bayes risk jS]. ROC curves will be discussed further below, together 
with the related maximum realisable ROC (MRROC) curve Although our 
algorithm produces a three-way final classification (crop, weed and soil), stages 
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1 and II are both binary classifiers, so we can analyse their performance using 
ROC methods. 

We will first outline the segmentation algorithm prior to discussing evaluation 
of the performance of stages I and II. Final results for the complete algorithm 
are then presented and discussed in the light of our knowledge of its constituent 
parts. 

2 The Segmentation Algorithm 

The two stage segmentation algorithm is sketched in the following sections; for 
the sake of brevity, details of the algorithms are not given (these may be found 
elsewhere HSCSI) , but sufficient information is provided to allow the performance 
evaluation sections to be understood. 



2.1 Stage I: Plant Matter Extraction 

The experimental vehicle is equipped with a monochrome camera that is fitted 
with a filter which blocks visible light, but allows near infra-red wavelengths to 
pass. Many researchers, including for example Biller have noted that the con- 
trast between soil and plant matter is greater in the near infra-red wavelengths 
than the visible, and this allows us to use a grey level threshold to extract pi- 
xels which represent plant matter from the images captured by the vehicle as 
it traverses the field. We use an adaptive interpolating threshold algorithm, to 
allow for the fact that there is often a brightness gradient across many of the 
images captured by the vehicle. The cause of such a gradient is most likely the 
position of the Sun relative to the ground plane and the vehicle’s camera, and 
the interaction of the illuminant with the rough surface of the soil. A simple 
linear variation in intensity between the upper and lower parts of the image is 
used to allow for such effects. Accurate modelling of illumination and reflectance 
effects is a complex issue and not of direct concern to this work. More principled 
models are known for surface reflectance, such as those due to van Branniken 
et al m or Oren and Nayar m 

The algorithm is also adaptive to the average brightness of the image, which 
offers some robustness to changes in illumination as, for example, when the Sun 
is temporarily masked by a cloud. A mean grey-level is computed for both the 
top (/ri) and bottom (/X 2 ) halves of the image and these two means are used 
as fixed points to linearly interpolate a mean /i(?//) across the vertical pixel co- 
ordinates of the image. The classification of output pixels 0{xf, yj) is then given 
by the adaptive interpolating thresholding algorithm: 



0{xf,Vf) 



fP if I{xf,yf) > ayijjf) 

\ S if I(xf,yf) < aniyf) ’ 



( 1 ) 



where P denotes plant matter (crop or weed) and S soil. The decision rule of 
equation d is used in a chain-code clustering algorithm 0 whereby groups of 
neighbouring above-threshold pixels are clustered into “blobs”. Each blob is 
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described by the pixel co-ordinates of its centroid in the image, and its size in 
number of pixels. The process is illustrated in figure ^ which shows an image and 
the plant matter extracted from it automatically. It can be seen from the figure 
that some of the plants fracture into multiple blobs. This is largely caused by 
shadows falling between plant leaves which lead to areas of the plant in the image 
that lie below the chosen threshold. Another problem that sometimes occurs is 
that neighbouring plants sometimes become merged into a single feature. Whilst 
there is little that can be done about the latter problem, the feature clustering 
technique in stage II of the algorithm aims to address difficulties caused by plant 
features fracturing. 




Fig. 1. An image and its automatically extracted plant matter. 



2.2 Stage II: Crop/weed Discrimination 



The image on the left of figure ^ shows that the crop plants grow in a fairly 
regular pattern in the field, and also that they are generally larger than the 
weeds. These are the two pieces of information that we exploit in the second 
stage of the segmentation algorithm, which aims to separate the set of plant 
matter features (denoted P) into subsets of crop (C) and weed (W). The first 
step of this stage is to filter the plant matter features on the basis of their 
size in the image. Justification for this decision is provided by figure 0 , where 
histograms of the feature sizes (in pixels/feature) are plotted for both weed and 
crop. This data is derived from manually segmented images that we use as our 
ground truth data throughout this paper. More details of this data are given 
below. 

It can be seen from the histograms that the vast majority (in fact 95%) of 
the weed blobs have a size of less than 50 pixels, whilst most (90%) of the crop 
blobs have a size of 50 or pixels or greater. This supports the claim that the 
weeds are typically smaller than the crop. 

Thus, we have a straightforward algorithm that places a threshold on the 
size s of the image features. This may be expressed as follows: 



Class(feature) 



W if s (feature) < 

P if s (feature) > ’ 



(2) 




On the Performance Characterisation of Image Segmentation Algorithms 355 




Blob size (pixels) 



WO I 1 1 1 1 1 1 1 1 1 

800 - 
700 - 
„ 600 - 
S. 500 - 
J 400 - 
Z 300 - 
200 - 
100 - 

0 - I ' ' ' ' ' ' ' ' 

0 5 10 15 20 25 30 35 40 45 50 

Blob size (pixels) 



Fig. 2. Blob size histograms. Left: weed blobs. Right: crop blobs. In both histograms, 
the right-most bin (marked 50) counts all blobs of size > 50. Note that the crop feature 
histogram, most of the bins are empty, except for the right-most. 

where s (feature) is the size of an image feature in pixels, and is the size thres- 
hold. 

The second step of stage II of the algorithm makes use of the regular grid 
pattern formed by the crop as they are planted in the field. The grid pattern 
is used as a cue for vehicle navigation CSl, where the position of the vehicle 
relative to the crop grid and the dimensions of the grid are estimated by an 
extended Kalman filter (EKF) |3j. The EKF also produces a covariance matrix 
that describes the level of confidence in the current estimate. The state estimate 
is used to predict the position of each plant within the grid, and an algorithm 
akin to a validation gate m is used to cluster all plant matter features within 
a certain radius of the predicted crop plant position. 

The validation gate has proved to be effective as an outlier rejection me- 
chanism in practical Kalman filtering applications [E|. The algorithm combines 
the uncertainty on the predicted feature position and the uncertainty attached 
to the observed data to define a validation region outside of which candidate 
feature matches are rejected as being outliers. In our algorithm, we take the 
uncertainty of the estimated plant position and combine it with a user defined 
region which describes a radius on the ground plane about the plant centroid 
within which all of the crop plant should lie. This defines an association region in 
the image inside of which all plant matter features are labelled as crop (C), and 
outside of which the features are labelled as weed ( W) . The schematic diagram 
in figure 0 illustrates the components of the association region. Full details of 
the algorithm can be found elsewhere m- The size of the region which describes 
the user-defined plant radius is controlled by a single parameter r, the radius on 
the ground plane within which the crop plant matter should lie. This model im- 
plicitly assumes a distribution for the weed matter that gives lower probability 
of weed occurrence than plant occurrence within the radius r. 



3 Evaluation Using ROC Curves 

The receiver operating characteristic (ROC) curve 0 supports the analysis of 
binary classification algorithms whose performance is controlled by a single para- 
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Fig. 3. The construction of the association region. 

meter. For each parameter setting, algorithmic output is compared with ground 
truth data, and four numbers are calculated; TP, the number of “positive” ca- 
ses correctly classified as positive; TN, the number of “negative” cases correctly 
classified as negative; FN, the number of positive cases incorrectly classified as 
negative; and FP, the number of negative cases incorrectly classified as positive. 
In the statistical literature, FN cases are type I errors, and FP cases type II 
errors m- From these four figures, two independent quantities are construc- 
ted, the true positive ratio, TPR = TP/(TP-|-FN), and the false positive ratio, 
FPR=FP/(FP-|-TN). To construct an ROC curve, a set of algorithm parameter 
values are chosen, and for each of these, the TPR and FPR values are calculated 
and plotted against each other. The set of TPR, FPR pairs form the ROC curve. 

To characterise the performance of our algorithms, we shall use the area 
underneath the ROC curve. This metric has often been used to compare the 
performance of different algorithms across the same data sets I2I1I6I . but we will 
use it to compare the performance of stage I of our algorithm across a number 
of test data sets which represent different stages of crop growth and weather 
conditions that the vehicle is likely to encounter. The performance of stage II 
across these data sets is assessed using the maximum realisable ROC (MRROC) 
curve. It is also possible to use the slope of the ROC curve to select a value for 
the algorithm’s controlling parameter which minimises the Bayes risk associated 
with the decision being made, van Trees izg provides full details. 



3.1 The MRROC Curve 

In the analysis described above, variation of a single decision parameter in a 
classification algorithm leads to the formation of the ROC curve. Each point on 
the curve characterises an instance of the classification algorithm that we call 
a classifier. If a single parameter is undergoing variation, then all of the classi- 
fiers lie along the ROC curve. This is the case within stage I of our algorithm, 
the adaptive interpolating threshold, which has gain parameter a, as defined in 
equation Q 

When an algorithm has more than one parameter, then it will generate a 
cloud of classifiers in the ROC space. The convex hull of this cloud is the MRROC 
curve H31, and the area underneath it reflects the best overall classification 
performance it is possible to obtain from this group of classifiers. We will use 
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the area under the MRROC curve to compare the operation of stage II of our 
algorithm, which has two parameters and r (the size threshold and clustering 
radius, respectively), on different data sets. The set of classifiers which provide 
the best performance is comprised of those that lie on the MRROC curve. Unlike 
the normal ROC curve which is a function of one decision parameter alone, it is 
not possible to set the algorithm operating point on the basis of the slope of an 
MRROC curve. 



4 Characterisation of the Algorithm 

We will now deal with algorithm characterisation, which is the evaluation of 
algorithmic performance over a range of different data sets. For the purposes 
of performance evaluation, we require image data sets which are representative 
of the application, and also a set of labelled images which represent the “true” 
segmentation of these scenes into the classes of interest to compare with the 
algorithmic output m- 



4.1 Ground Truth Image Data 

Four sequences of images captured from the vehicle were used in off-line tests 
of the classification algorithm. An example image from each sequence is given 
in figure 01 (a)-(d). The sequences have been chosen to represent a range of 
typical crop growth stages and imaging conditions, although this range should 
by no means be considered exhaustive. The sequence properties are summarised 
in table m The deep shadows seen in figure El D are a result of b right sunlight. 



Sequence 


# images 


Crop age 


Weed density 


Weather 


A 


960 


8 weeks 


low 


cloudy 


B 


960 


3 weeks 


very low 


overcast 


C 


1380 


6 weeks 


moderate 


overcast 


D 


1280 


3 weeks 


very low 


sunny 



Table 1. Properties of the image sequences. 



Haralick m asserts that performance characterisation requires a test set 
of statistically independent data. To this end, a subset of each image sequence 
was chosen such that no two images contain overlapping areas of the ground, 
which ensures that no two pixels in the test set represent the same patch of soil 
or plant. For each image in these test sets (a total of 66 images across the four 
sequences), a ground truth labelling was produced by hand segmenting the image 
pixels into four classes: crop, weed, soil and doubt. The ground truth images 
have been produced by hand using standard image editing software, and are 
subject to error, especially at border pixels where different image regions (crop, 
weed or soil) are adjacent. Some of these pixels will be incorrectly classified as 
their adjacent class, whilst some will be of genuinely mixed class. Alexander |2] 
noted such problems with border pixels and proposed that at the border between 
foreground (in our case plant matter) and background (soil), regions of doubt 
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C D 



Fig. 4. Examples from the four image sequences A - D. 

should be inserted, and the pixels within these doubt regions should be ignored 
for the purpose of assessing classifiers. All pixels that are on the border of plant 
matter and soil in the ground truth images are assigned to the doubt class and 
ignored in the classification assessments. 



4.2 Stage I 

A set of 27 threshold gain levels was chosen and the algorithm applied to the 
test images to generate the TPR,FPR pairs that constitute the ROC curve. The 
area under eac h of the curves plotted for sequences A - D is give n in table 0 



Data Set 


Area under ROCC 


Area under MRROCC 


A 


0.9957 


0.9974 


B 


0.9779 


0.9997 


C 


0.9846 


0.9996 


D 


0.9241 


0.9993 



Table 2. Area underneath ROC curves for algorithm stage I, sequences A-D (left) and 
for the MRROC curves for algorithm stage II (right). 



The performance of stage I on each of the four data sets is reflected by 
the measures of area underneath the ROC curve shown in table 0 These show 
that the algorithm performs best on sequence A, where the plants are large and 
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there are few shadows, with sequences C and B following. The lowest overall 
performance is seen on sequence D, caused by the heavy shadows present (figure 

ElD). 

4.3 Stage II 

As noted above, to compare the performance of stage II of the algorithm on our 
data sets, we use MRROC analysis. The two parameters c and r, described in 
section are varied systematically (27 samples of each parameter, yielding a 
total of 729 classifiers) to produce a cloud of TPR,FPR pairs in the ROC space. 
The convex hull of these points constitutes the MRROC curve and the area 
underneath the curve is calculated for each data set and used as a metric for 
comparison, with better performance indicated as usual by an area closer to 1. 

Recall that stage II of the algorithm comprises two steps, a size filtering step 
followed by feature clustering on the basis of proximity to the crop grid pattern. 
In the fully automatic algorithm, the input features are provided by stage I, and 
the grid position by the extended Kalman filter crop grid tracker HSI- In our 
first experiment, we removed the dependency on both of these algorithms by 
locating the crop grid by hand, and used the ground truth classified features as 
our input. In his outline of a performance characterisation methodology Haralick 
m states that testing algorithms on perfect input data is not worthwhile; if 
the algorithm’s performance is less than perfect, then a new algorithm should 
be devised. In an ideal world, this would be the case, but our the crop/weed 
discrimination problem is difficult; capturing the large variations in size and 
shape of each sort of plant devising an algorithm to fit such models to image 
data will not be easy, so we currently have to settle for an imperfect algorithm 
that makes mistakes even on perfect data. In this case, testing on perfect input 
data tells us the best performance that the algorithm can be expected to deliver. 

The areas underneath the MRROC curve for each sequence in this experiment 
are given in the right-hand columns of table |21 whilst a section of the MRROC 
curve, and the cloud of classifiers in the ROC space, is plotted in figure|S| (where 
we take crop pixels to be positives and weed pixels to be negatives). In table 
0 the performance of the stage II algorithm is seen to be consistent over each 
sequence, and very close to the ideal of 1 in each case. As noted above, to generate 
the curve for each sequence, we ran 729 trials of the algorithm over each of the 4 
sequences, a time-consuming task. To cut down on computational effort for the 
fully automatic algorithms, we selected a single pair for each sequence. The 
point selected was that closest to the ideal (0,1) point in ROC space. A more 
principled selection of operating parameters might be possible if the values and 
costs of correct and incorrect decisions were known. For example, if the farmer 
wishes to remove all weeds and is willing to risk some crop in this process, the 
value of true negatives (correctly classified weed) would be high, and the cost 
of a false positive (weed classified as crop) would be higher than the cost of a 
false negative (crop classified as weed). If crop fertilisation was a priority, a true 
positive (correctly identified crop) would be high, and the cost of a false negative 
would be higher than the cost of a false positive. The values of c and r chosen. 
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FPR 

MRROC classifiers 



Fig. 5. The MRROC curve for ground truth plant matter segmentations of sequence 

C. 

together with their corresponding TPR and FPR, are given for each sequence in 
table 0 









Parameter Setting 


Automatic Tracking 


Sequence 


? (pixels) 


r (mm) 


TPR 


FPR 


TPR 


FPR 


A 


100 


450 


0.9950 


0.0 


0.9939 


0.1564 


B 


30 


100 


0.9940 


0.0 


0.9982 


0.0 


C 


80 


100 


0.9970 


0.0094 


0.9981 


0.0307 


D 


30 


100 


0.9975 


0.0638 


0.9993 


0.2017 



Table 3. Operating points for the size filtering and clustering algorithms, and their 
corresponding TPR and FPR chosen in the parameter setting experiment (left), to- 
gether with the TPR and FPR realised under automatic tracking (right, and section 

lOl . 



4.4 Segmentation of Ground Truth Plant Images 

Before combining stages I and II of the algorithm and analysing overall perfor- 
mance, we test stage II on the ground truth plant matter images under automatic 
tracking by our Kalman filter algorithm HS|. We perform this experiment in or- 
der to assess stage II of the algorithm in such a way that is as far as possible 
independent of the image thresholding algorithm of stage I. The test is not ent- 
irely independent of the image processing errors, because they have an effect 
on the tracker’s estimate of the crop grid position that is used in the feature 
clustering algorithm, but it does allow us to compare the true positive and false 
positive ratios for crop pixels directly with those found in the parameter selection 
experiments. 

We use the Kalman filter’s estimate of the crop grid position in conjunction 
with the size filtering and feature clustering algorithm of algorithm stage II. After 
this processing, we have two sets of classified pixels for each image sequence. The 
first set is C, the ground truth plant matter pixels that have been classified as 
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crop. The second set is W, the ground truth plant matter pixels that have been 
classified as weed. Given the two sets C and W, we can produce true positive 
(ground truth crop pixels that are classified C) and false positive (ground truth 
weed pixels classified as C) ratios for the automatic segmentation. These ratios 
are given in table 13 in the column marked ‘automatic tracking’. 

Before we compare the ratios from the tracking experiment with those from 
the parameter setting experiment, we reiterate the main differences between 
the two experiments. In the tracking experiment, the association region, within 
which all features are classified as crop, includes the uncertainty on the grid 
position, so will be larger than the corresponding region in the parameter setting 
experiment where the grid position was assumed to be known perfectly. We 
might expect that, as the association region expands, more image features will 
fall within it, so both TPR and FPR are likely to rise. The second difference is 
that the grid position in the tracking experiment is determined automatically by 
tracking the features derived from image processing, whilst the in the parameter 
setting experiment, the grid was placed by hand, and will be unaffected by any 
image processing errors. 

If we now compare the tracking and parameter setting figures in table 0 
we can see how these two experimental differences manifest themselves for each 
sequence: 

Sequence A: The TPR drops and the FPR rises when the grid is tracked 
automatically. This sequence is the most difficult to track, because many crop 
plant features merge together so that feature centroids do not represent plant 
locations. Poor tracking is almost certainly the cause of the increased errors. 
Sequence B: The TPR rises for the automatic tracker, where the association 
regions will be larger than in the parameter setting experiment owing to the 
increased uncertainty on plant position. The FPR is unaffected; this is a 
result of the low weed density in sequence B. 

Sequence C: Both TPR and FPR increase under automatic tracking. This 
will be caused by the larger association region as it incorporates plant posi- 
tion uncertainty from the tracker. 

Sequence D: As with sequence C, both TPR and FPR increase. Owing to 
the strong shadows present in this sequence, automatic tracking is difficult, 
so the uncertainty on individual plant position will be large; this is reflected 
in the dramatic rise in FPR. 

The figures in table 0 show that the combination of size filtering and feature 
merging is very effective for classifying crop features, with true positive ratios in 
excess of 0.99 in for every sequence. The algorithm is less effective at weed pixel 
classification when tracking is difficult, as in sequences A and D, where the FPR 
rises to 15% and 20% respectively. This is not surprising because the success 
on the feature clustering algorithm hinges on the crop grid tracker providing 
good estimates of the crop position. However, when the tracking is easier, as in 
sequences B and C, the FPRs are much lower, 0.0% and 3.07% respectively. 
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4.5 Combining Stages I and II 

The second segmentation experiment relies wholly on the thresholding and chain- 
coding algorithms and tests the full automatic segmentation algorithm that com- 
bines stages I and II. In the previous experiment, we knew that all the features 
presented for size filtering and clustering were true plant matter. In this expe- 
riment, some soil pixels will be misclassified as plant matter (and labelled C or 
W), and some plant matter pixels (crop or weed) will be labelled S. A suitable 
value for the threshold gain a for each sequence was determined from the slope 
of the ROC curves generated for each sequence m by using an empirical esti- 
mate of the Bayes costs and values and prior probabilities computed from the 
test data. 

Each of the tables 0 - Q presents the percentage of the ground truth crop, 
weed and soil pixels classified as C, W and S, together with the total number of 
ground truth pixels in each class from the ground truth images of sequences A 
- D. The numbers of pixels bordering ground truth crop and weed features are 
also given as an indication of the number of doubt pixels that have been ignored 
in the classification totals. Each image is composed of 384 x 288 pixels, although 
only pixels in the region of the image (approx65%) that will pass underneath the 
autonomous vehicle’s treatment system (a bar of spray nozzles that runs along 
the front axis of the vehicle) are classified. 

Perusal of the figures in tables El -Q prompts a number of observations: 

1. In every sequence, in excess of 98% of the soil pixels are correctly classified 
as S. 

2. In each sequence, more crop pixels are misclassified as S than misclassified 
as W. 

3. In each sequence, more weed pixels are misclassified as S than misclassified 
as C. 

4. In sequences A and C, a greater percentage of crop pixels are correctly 
classified C than the percentage of weed pixels that are correctly classified 
as W. 

5. In sequences B and D, a greater percentage of weed pixels are correctly 
classified W than the percentage of crop pixels that are classified C. 

6. The number of doubt pixels that border ground truth weed features out- 
number the total number of ground truth weed pixels in every test sequence. 

7. The total number of ground truth crop pixels outnumber the doubt pixels 
that border the crop features in every test sequence. 

Observations 1, 2 and 3 directly reflect the performance of the adaptive inter- 
polated grey-level thresholding algorithm, which misclassifies a large percentage 
of the plant matter pixels as soil. This will obviously be the most common 
misclassification, because plant matter is most often seen against a background 
of soil rather than other plant matter. The observations do, however, highlight 
the fact that the plant matter/soil discrimination problem requires more atten- 
tion if image segmentation is to be improved. 
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Ground truth 





Classified 


as 


1 Number of I 




C (%) 


W (%) 


S (%) 


pixels 


border pixels 


Crop 


95.11 


1.28 


3.61 


331,222 


52,688 


Weed 


10.50 


51.88 


37.62 


505 


1,373 


Soil 


0.33 


0.03 


99.64 


905,200 


- 



Table 4. Sequence A segmentation results, percentages of true numbers of crop, weed 
and soil pixels classified as C, W or S, and the number of pixels that border crop and 
weed features. There are 16 ground truth images for sequence A. 



Ground truth 





Classified 


as 


1 Number of I 




C(%) 


W (%) 


S (%) 


pixels 


border pixels 


Crop 


78.90 


3.72 


17.38 


53,514 


19,615 


Weed 


0.0 


81.8 


18.2 


934 


3,455 


Soil 


0.01 


0.04 


99.95 


1,152,254 


- 



Table 5. Sequence B segmentation results, percentages of true numbers of crop, weed 
and soil pixels classified as C, W or S, and the number of pixels that border crop and 
weed features. There are 17 ground truth images for sequence B. 



Ground truth 





Classified 


as 


1 Number of I 




C(%) 


W (%) 


S (%) 


pixels 


border pixels 


Crop 


81.72 


4.51 


13.76 


141,075 


19,615 


Weed 


3.93 


56.90 


39.17 


17,160 


18,544 


Soil 


0.06 


0.24 


99.7 


1,195,308 


- 



Table 6. Run C segmentation results, percentages of true numbers of crop, weed and 
soil pixels classified as C, W or S, and the number of pixels that border crop and weed 
features. There are 17 ground truth images for sequence C. 



Ground truth 





Classified 


as 


1 Number of I 




C(%) 


W (%) 


S (%) 


pixels 


border pixels 


Crop 


55.00 


4.11 


40.89 


41,411 


13,171 


Weed 


6.31 


73.52 


20.17 


1,046 


2,202 


Soil 


0.06 


1.13 


98.81 


1,003,418 


- 



Table 7. Run D segmentation results, percentages of true numbers of crop, weed and 
soil pixels classified as C, W or S, and the number of pixels that border crop and weed 
features. There are 16 ground truth images for sequence D. 
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Observations 4 and 5 suggest that the larger plants seen in image sequences 
A and C are more easily identified than the smaller plants in sequences B and D . 
The reasons for this are unclear, but may be related to changes in the infra-red 
reflectance of the crop plants as they age. 

Observations 6 and 7 show that the weed features, which are dominated by 
border pixels, are typically smaller than the crop features. This has already been 
illustrated in figure Inland forms the basis of the size threshold algorithm. 

If we ignore the crop and weed ground truth pixels that the segmentation 
algorithm labels S, we can construct true positive and false positive ratios for 
the crop and weed pixels that have been classified as plant matter (either C or 
W). These figures are given for each sequence in table El and show that those 
pixels which are identified as plant matter are separated into the crop and weed 
classes with some success. This allows us to conjecture that if plant matter/soil 
discrimination were more reliable then figures similar to those in table El might 
be obtained. , ^ ^ , 



Sequence 


TPR 


FPR 


A 


0.9639 


0.1683 


B 


0.9550 


0.0 


C 


0.9477 


0.0650 


D 


0.9305 


0.0790 



Table 8. TPR and FPR for the correctly identified plant matter pixels in sequences 
A-D. 



5 Conclusions 

We have used a novel two stage algorithm developed for a horticultural applica- 
tion to illustrate that breaking an algorithm down into its constituent compo- 
nents and testing these individually can provide a better understanding of overall 
behaviour. Analysis of the test results allows us to conclude that the majority of 
the errors in the system are propagated forward from stage I of the algorithm. 
It was seen that II performs effectively on the data that is correctly propagated 
form stage I, so algorithm development should focus on improving the plant mat- 
ter/soil segmentation. Empirical discrepancy analysis based on ROC curves and 
type I and type II statistical errors was used for the individual binary classifiers, 
and overall tri-partite classification figures given for the full algorithm. 
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Abstract. Using forensic fingerprint identification as a testbed, a sta- 
tistical framework for analyzing system performance is presented. Each 
set of fingerprint features is represented by a collection of binary codes. 
The matching process is equated to measuring the Hamming distances 
between feature sets. After performing matching experiments on a small 
data base, the number of independent degrees of freedom intrinsic to 
the fingerprint population is estimated. Using this information, a set of 
independent Bernoulli trials is used to predict the success of the system 
with respect to a particular dataset. 



1 Introduction 

Given an image of a particular target, computer vision recognition systems such 
as j0| search a large database in order to find a second image of the target. 
The approach usually takes several steps. The initial image is characterized by 
a set of features forming a target template. Feature extraction is performed on 
all candidate images in the database. Each set of candidate features is matched 
with the template set (this may require some form of registration). A similarity 
function is used to determine the merit of each match. The candidates with the 
highest match scores are reported to the user. It is important for developers of 
such systems to have answers to the following questions: 

— As the database grows what will happen to the reliability of the system? 

— What is the optimal performance that can be expected for a given datum? 

— Is there room for improvement in the system and if so where should future 
research resources be allocated? 

In this work a statistical framework for analyzing these questions is presented 
in the context of forensic fingerprint identification. Each set of image features 
is represented by a collection of binary codes. The matching process is viewed 
as a mechanism which measures the Hamming distance between various codes 
generated by the template and those generated by a candidate. The number of 
independent degrees of freedom intrinsic to the population is then measured by 
performing experiments on a small representative data base. Using this infor- 
mation, the ranking for the true match of a particular template can then be 
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modeled as a series of independent Bernoulli trials. Questions regarding system 
reliability, scalability and maturity can then be addressed. 

In forensic identification the target image is known as a latent print such 
as one found at a crime scene. The database of candidate images are known as 
tenprints and are taken under controlled conditions. Some tenprint databases can 
have hundreds of millions of entries. Many fingerprint systems such as 0 and jS] 
use minutiae as their image features. These are points where ridges terminate or 
bifurcate. They are characterized by their 2D location and an angular measure 
corresponding to the orientation of the surrounding ridge flow. Stretching by up 
to 30 percent can take place and there may be false and missing minutiae. The 
latent is usually just a partial print which can have as few as 5 or 6 minutiae 
(the average tenprint has over 80). The position of the latent with respect to the 
tenprint coordinate system is usually unknown. Mechanisms based on approaches 
such as graph matching matched filtering |2j are often used to perform 

the matching between sets of minutiae. Each tenprint is ranked based on the 
score received during the matching process. If the search is successful, the true 
mate will receive a rank at or near the top of the list. 

By representing minutia structure as sets of binary codes and performing 
matching experiments on a local database of 300 tenprints the following infor- 
mation will be determined: 

— The number of degrees of freedom found in a particular fingerprint code. 

— The expected ranking of the true mate for a given latent with respect to a 
700,000 print database. 

— The level of performance of a particular matching algorithm based on the 
analysis of 86 latent prints. 



1.1 Previous Work 

Methods for computing the probability of encountering two identical sets of 
features from unrelated fingerprints have been found in the works by Ul and 
13 . However these approaches assume that only local features are correlated. 

Like fingerprints, iris matching has become a viable tool for online identifi- 
cation. In the work by Daugman a single binary code is generated for each 
iris. Matching can then be accomplished by computing the Hamming distance 
between codes. Statistical significance is evaluated by measuring the number of 
independent degrees of freedom that these codes possess. The matching process 
can then be equated to a series of independent Bernoulli trials which are modeled 
probabilistically. In this paper, Daugman’s approach is extended by generating 
multiple codes for a single set of features so that the latent and a tenprint need 
not be registered in advance. 

2 Measuring Average Information Content 

The purpose of the first experiment is to determine the information contained 
within a 100x100 pixel region of a fingerprint. Ostenburg’s 0 grid system is 
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used to generate a binary code for the region. In this experiment only minutia 
position is considered. One conclusion that is drawn from this experiment is that 
minutiae are correlated with one another. 

By representing local minutia structure as a binary code, the information 
contained in a fingerprint can be quantitatively assessed. A similarity function 
between two sets of minutiae is developed based on the Hamming distance bet- 
ween their binary codes. By extracting a large number of codes and computing 
the Hamming distances between all pairs of unrelated codes, the relative fre- 
quencies of the similarity function can be measured. The parameters for the 
appropriate probability distribution function of the similarity function can then 
be estimated. These parameters are used to determine the number of statistical 
degrees of freedom that are intrinsic to the fingerprint population. 

2.1 Minutia Structure to Binary Codes 

Given a particular minutia A, a 10 by 10 grid of squares is centered on minutia 
A and is rotated so as to be aligned with minutia A’s orientation. Each square 
is 10 pixels by 10 pixels in dimension. A binary code is formed by assigning a 
single bit to each square. If a square contains a minutia other that minutia A, 
then its bit is set to 1. In this way a binary code is used to represent the local 
minutia structure around minutia A. See figure Q for an example of the code 
extraction process. 

Using a database of 300 tenprints, 10 minutiae were selected at random from 
each print. A code was generated for each minutia resulting in 3000 codes. The 
relative frequency of each bit can be seen in figures O and 0 As one would expect 
the 8 bits around the center of the grid have a relatively low frequency. These bits 
are excluded leaving a 92 bit code. From these measurements it was estimated 
that the probability p that a bit is turned on was uniformly distributed and that 



2.2 Hamming Distance 

Given two 92 bit codes X and Y, the similarity of the two codes can be measured 
based on the Hamming distance. The Hamming distance is defined as the average 
value of the exclusive or between each pair of bits in the two codes. A similarity 
function S which is a maximum for identical codes is defined as: 



p= 0.05381. 



( 1 ) 




( 2 ) 




( 3 ) 



where 




0 otherwise 
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Fig. 1. A minutia is selected and a grid is centered on and oriented with this minutia. 
The green dots are identified minutiae. The blue marks indicate squares where local 
minutiae have been found. 
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Fig. 2. The bit number versus the probability of the bit being set to 1. As expected 
the bits near the center of the grid have a lower probability of activation. 
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Fig. 3. A grid depicting the probability of a bit being set to 1. The bits near the center 
of the grid have a lower probability of activation. 



Each term of S can be viewed as a Bernoulli trial. For all k, the expected 
value of H[fc] is equal to a where 

a=p^ + {l-p)^. (4) 

It follows that 

S = E{S) = a = 0.89817. (5) 

If the initial assumption is made that all the bits in the code are independent, 

then it would be expected that 

CX5 = yjE[{S-S)^] = = 0.031529. (6) 

The complete independence assumption implies that there are 92 degrees 
of freedom in the minutia structure. The following tests will show that this 
assumption is not valid. 

Using the 3000 codes, the similarity function S was computed for every pair 
of unrelated codes. The observed mean and standard deviation of S were: 

Sobs = 0.898185 (7) 

and 

Uobs = 0.033366 (8) 

The observed mean is very close to the predicted value (equation |SI) . However 
the observed standard deviation of S is larger than expected. It is therefore 
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concluded that the true number of degrees of freedom N, which can be found 
by solving the formula: 

a(l — a) 

V N 

is iV « 82. 

In other words, taking a measurement of S between two unrelated codes, is 
equivalent to counting the number of times a weighted coin comes up heads out 
of N tosses such that the probability of getting heads on an individual toss is 
equal to a. 

Given the estimates of a and N, the probability of observing a value of 
S = ^ can be computed. The first step is to determine mi such that 



mi m 
^ ^ 92 



( 10 ) 



The probability of observing S is calculated by: 



P{S 



92 ' 



N Nl 
92 mi!(iV — mi)! 



(l-a)^-™b 



( 11 ) 



This equation is based on the standard binomial distribution where N is the 
number of trials, mi is the number of positive results and a is the probability 
of a positive result on a given trial. The scaling factor ^ is used to compensate 
for the fact that there are 92 possible results as opposed to just N. Linear 
interpolation is used since in general mi is not an integer. Figure E] shows the 
measured frequencies of S along with P{S) computed in equation It turns 
out that the probability of finding two identical codes is 1:6677 as opposed to 
1:19,544 which was computed under the assumption of complete independence. 



3 Statistical Significance of a Particular Latent 

When searching for the true mate of a particular latent, each tenprint in the 
database is ranked based on a score given during the matching process. If the 
search is successful, the true mate will be ranked at or near the top of the list. 

In this section an experiment based on an “ideal” matching mechanism, will 
be performed in order to evaluate a particular latent and its true mate. Using 
measurements taken from a 300 print database, the odds that a false print will 
out-rank (i.e. get a higher score than) the true mate will be determined. A 
prediction of the ranking that the true mate would receive from a 700,000 print 
search will be made. This will be compared to the results achieved by a real 
search performed by an in house matching algorithm referred to here as the 
“MATCHER” . 

The ideal matching mechanism is modeled as a form of template matching. 
In this process a set of transformations between the latent and the tenprint 
coordinate system are generated. Each transformation, which is composed of a 
translation and a rotation, is applied to the latent print. Once the latent has been 
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Fig. 4. The dots show the measured frequencies of the similarity function S (defined 
in equation 0. The solid line is the estimated probability density function shown in 
enuatinn I1 1 1 As can be seen there is almost perfect agreement. 



transformed, it is determined whether or not each latent minutia can be matched 
with a tenprint minutia. This can be viewed as a mechanism for generating a 
binary code, where bit i is set to 1 iff minutia i can be matched to a tenprint 
minutia. A merit function M for the transformation is defined as the sum of the 
bit values divided by the number of bits in the code. The score assigned to the 
tenprint is set to the merit of the transformation with the highest merit score. 

A transformation is generated by aligning a single latent minutia with a single 
tenprint minutia. The alignment is based on both position and orientation (angle 
of the dominant ridge flow near the minutia) of the minutia. The latent minutia 
used to construct the transformation is not used when the merit of the resulting 
binary code is computed. 

Each minutia is defined by an (x, y, u>) coordinate where {x, y) represents 
position and uj represents orientation. Let (x',y',uj') represent the transformed 
coordinates of minutia i. The Tth bit for the transformation will be set to 1 iff 
there exists a tenprint minutia with coordinates {x, y, oj) such that: 

\/ (x' - x)2 + {y' - yY < As (12) 

and 

\u'-uj\<A, (13) 

where Ag is a spatial threshold and Aa is an angular threshold. 

By modeling the matching process in this way, it can be argued that a some- 
what optimistic prediction will be generated. This is because it is assumed that 
a rigid transformation can align a latent with its true mate in spite of the fact 
that stretching of up to 30 percent can occur. If minutia descriptions are the 
only features used to represent the fingerprint, a true matching algorithm would 
be forced to use weaker assignment criteria. However variation due to stretching 
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can be reduced by considering additional fingerprint features. Ridge topology is 
invariant with respect to stretching. The number of ridges that cross a straight 
line connecting any two points on a fingerprint can be used as a normalized 
distance measure. Fiducial points such as cores and deltas (singularities in the 
ridge flow field) can be identified and used to estimate global transformations 
which compensate for many stretching effects. For these reasons it is argued that 
rigid template matching is a reasonable model for the matching process. 

3.1 Probability Distribution of the Merit Function 

We now consider a particular latent print and its true mate as shown in figure 
0 The position and orientation of each minutia are represented by a dot and 
a small line segment. There are 14 minutiae in the latent print. An examiner 
determined that there are 11 legitimate minutia assignments between the latent 
and its true mate. A triangulation is applied to the matched minutiae to make 
it easier to see the assignments. The merit score for the true mate is 10 out of 
13 possible matches. It is not 11 out of 14 because one pair of minutiae is always 
needed to generate a transformation. 

In order to determine the spatial and angular thresholds, an affine transform 
based on a least squares fit between the identified minutia correspondences of 
the latent and the true mate was computed. The values and Aa were set 
as tightly as possible while still allowing the transformed latent minutiae to be 
matched to their true assignments. An affine transform was used in order to 
compensate for possible stretching in the print. In this example the thresholds 
were set to: 



and 



= 15 pixels 


(14) 


13.29 degrees 


(15) 



The next step is to compute a probability distribution function for the merit 
function M so that predictions can be made regarding a search on a large data- 
base for the true mate shown in figure 0 It is important to note that the derived 
PDF will only be applicable when considering this particular latent print. All 
possible transformations between the latent and 300 tenprints were generated. It 
was assumed that the latent print was roughly oriented so that transformations 
requiring too much rotation were rejected. A transformation was also rejected 
if the transformed latent minutiae were not contained within the convex hull of 
the tenprint minutiae. A total of 53,334 transformations were generated. 

The probability p that a transformed latent minutia i would be matched with 
a tenprint minutia was observed to be: 



p = 0.074. 



(16) 



This is the same as saying that the probability of the Ah bit being set to 1 is 
equal to p. Each bit in the code can be viewed as a Bernoulli trial. The expected 
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Latent 



True Mate 




Fig. 5. This figure shows a latent and its true mate. The position and orientation of 
each minutia is represented by a dot and a small line segment. There are 14 minutiae 
in the latent print. An examiner determined that there are 11 legitimate minutia as- 
signments. A triangulation is applied to the matched minutiae to make it easier to see 
the assignments. Note that there are 3 unassigned latent minutiae 

value of M is equal to p. If we assume independence between bits in the code 
and since there are only 13 bits (one minutia is always excluded from the code 
since it is used to create the transformation) , the standard deviation of M would 
be expected to be: 



Since the observed standard deviation is higher than predicted, it is concluded 
that the latent minutiae are not completely independent. The number of stati- 
stical degrees of freedom N is calculated by solving: 




(17) 



However, the measured standard deviation for M was found to be 



aobs = 0.075057 



(18) 




(19) 




( 20 ) 



The probability distribution for M = ^ can now be formulated. Let 

Nm 



(21) 




then 



( 22 ) 
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Like equation this equation is composed of a scaling factor ^ and a binomial 
function for N trials and mi successes with a probability p of success on an 
individual trial. 

The measured frequencies of M for various values of m along with the pre- 
dicted probabilities of M derived from equation |22l are shown in table ^ 



Table 1. The first column shows the number of set bits in the code. The second column 
shows the measured frequency of observing such a code based on test made on a 300 
print database. The third column shows the predicted frequencies based on equation 
l2l As can be seen there is a reasonable agreement between the measured and predicted 
frequencies. 



Number 
of hits 


Measured 

frequency 


predicted 

frequency 


0 


0.401451 


0.366531 


1 


0.345802 


0.353033 


2 


0.182266 


0.185172 


3 


0.056662 


0.067505 


4 


0.011269 


0.017855 


5 


0.002231 


0.003445 


6 


0.000281 


0.000487 


7 


0.000037 


0.000051 


8 


0.000000 


0.000004 


9 


0.000000 


0.000000 


10 


0.000000 


0.000000 


11 


0.000000 


0.000000 


12 


0.000000 


0.000000 


13 


0.000000 


0.000000 



As previously stated, the merit that could be attributed to the true mate 
would be Using equation |22| the probability that a code generated by a false 

print will do as well or better than the true mate is computed to be one in 
116,041,312. There were 300 tenprints used to generate 53,334 transformations 
which means that there were approximately 177 transformations per print. So 
that one in 652,724 prints could be expected to out rank the true mate. Given 
a database with 700,000 tenprints, one false print could be expected to have a 
higher rank than the true mate. This results in an expected ranking of 2 for 
the true mate. In comparison, the matching algorithm MATCHER attempted 
to locate the true mate of this example from a real 700,000 print database. The 
true mate was given a ranking of 3. 

4 Evaluation of the MATCHER Performance 

In order to evaluate the general level of performance of the matching algorithm 
MATCHER, the previous analysis was applied to a set of 86 latent prints. These 
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prints have been identified as dijficult to match. In some cases there were less 
than 10 minutiae to work with. For each latent, the following steps were taken: 

1. The true assignments between the latent and its mate were picked by hand. 

2. The thresholds Ag and Aa were automatically set based on the relationship 
between the latent and its mate. 

3. Using measurements on a 300 print database, the PDF for the merit function 
M associated with the latent was determined. 

4. The odds of encountering a false print which would outperform the true mate 
was computed. 

5. The ranking of the true mate with respect to a 700,000 print search was 
predicted 

6. A real 700,000 print search for the true mate was performed using the mat- 
ching algorithm MATCHER 

In figure El the predicted and true rankings for this data set have been sorted 
and placed on a graph so as to generate a set of performance curves. 

The MATCHER significantly outperformed the predictions on 7 out of 86 
latents. However it did not do as well as expected on 13 out of 86 prints. This 
is reflected in the gap between the MATCHER performance curve and the pre- 
dicted performance curve. Table |3 shows the MATCHER and predicted top 10 
performances. The MATCHER failed to achieve a large number of top Is, ho- 
wever at the top 10 level, the MATCHER is only missing 4 prints. 

With respect to conclusions that can be drawn regarding the MATCHER 
performance, there does seem to be some room for improvement. However, the 
predictions were made using the assumption that latents and tenprints can be 
matched using a rigid transformation, yet stretching of up to 30 percent can 
happen and any realistic matcher must compensate for this. For this reason the 
predicted performance can be viewed as somewhat optimistic. 

There were a large number of prints that were deemed unmatchable and this 
agrees with the current state of diminishing returns on algorithmic performance. 

5 Summary and Conclusions 

A statistical framework for evaluating a fingerprint recognition system was de- 
veloped. By performing experiments on a local database it was shown that the 
features used for matching are not completely independent. It was shown how 
the matching process can be modeled as a set of independent Bernoulli trials. 
This lead to the ability to make predictions regarding specific datasets. By com- 
paring estimates of optimal performance with that achieved by the matching 
algorithm MATCHER, statements regarding the maturity of the system can be 
made. 

Since the fingerprint identification is similar in nature to many computer 
vision recognition systems, we believe that this approach is broadly applicable. 
Once the matching process is understood, experiments performed on a modest 
database may allow researchers to answer questions regarding system reliability 
and scalability. 
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Piadictad and Measured Psiformanca Curves 




Fig. 6. The graph shows the ordered rankings of the true mates for a 700,000 print 
database. The two curves represent the predicted performance and the performance of 
the MATCHER matching algorithm. An interpretation of a point in the graph with 
horizontal coordinate x and vertical coordinate y is that x true mates had a ranking 
worse than or equal to rank y. For example, by noting were the two curves intersect 
the horizontal dashed line, it can be determined that 29 true mates were predicted to 
have a ranking worse than 220 and that when the MATCHER was run, 36 true mates 
were found to have a ranking worse than 220. This means that the MATCHER did 
slightly less well than expected. 
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Table 2. This table shows the number of top rankings achieved by the MATCHER as 
compared to the number of predicted top rankings. 





MATCHER 


Predicted 


Top 1 


12 


23 


Top 2 


17 


26 


Top 3 


19 


28 


Top 4 


23 


29 


Top 5 


26 


31 


Top 6 


27 


31 


Top 7 


27 


32 


Top 8 


28 


32 


Top 9 


30 


33 


Top 10 


30 


34 
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Abstract. In recent years, the field of active-contour based image seg- 
mentation have seen the emergence of two competing approaches. The 
first and oldest approach represents active contours in an explicit (or 
parametric) manner corresponding to the Lagrangian formulation. The 
second approach represent active contours in an implicit manner cor- 
responding to the Eulerian framework. After comparing these two ap- 
proaches, we describe several new topological and physical constraints 
applied on parametric active contours in order to combine the advan- 
tages of these two contour representations. We introduce three key al- 
gorithms for independently controlling active contour parameterization, 
shape and topology. We compare our result to the level-set method and 
show similar results with a significant speed-up. 



1 Introduction 

Image segmentation based on active contours has achieved considerable success 
in the past few years [15]. Deformable models are often used for bridging the gap 
between low-level computer vision (feature extraction) and high-level geometric 
representation. In their seminal paper [8], Kass et al choose to use a parametric 
contour representation with a semi-implicit integration scheme for discretizing 
the law of motion. Several authors have proposed different representations [16] 
including the use of finite element models [3], subdivision curves [6] and analyt- 
ical models [17]. Implicit active contour representation were introduced in [13] 
following [19]. This approach has been developed by several other researchers 
including “geodesic snakes” introduced in [2]. 

The opposition between parametric and implicit contour representation cor- 
responds to the opposition between Lagrangian and Eulerian frameworks. Qual- 
ifying the efficiency and the implementation issues of these two frameworks is 
difficult because of the large number of different algorithms existing in the lit- 
erature. On one hand, implicit representations are in general regarded as being 
less efficient than parametric contours. This is because the update of an implicit 
contour requires the update of at least a narrow band around each contour. On 
the other hand, parametric contours cannot in general achieve any automatic 
topological changes, also several algorithms have been proposed to overcome 
this limitation [11, 14, 10]. 
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This paper includes three distinct contributions corresponding to three dif- 
ferent modeling levels of parametric active contours: 

1. Discretization. We propose two algorithms for controlling the relative ver- 
tex spacing and the total number of vertices. On one hand, the vertex spacing 
is controlled through the tangential component of the internal force applied 
at each vertex. On the other hand, the total number of contour vertices is 
periodically updated in order to constrain the distance between vertices. 

2. Shape. We introduce an intrinsic internal force expressions that do not 
depend on contour parameterization. This force regularizes the contour cur- 
vature profile without producing any contour shrinkage. 

3. Topology. A new algorithm automatically creates or merges different con- 
nected components of a contour based on the detection of edge intersections. 
Our algorithm can handle opened and closed contours. 

We propose a framework where algorithms for controlling the discretization, 
shape and topology of active contours are completely independent of each other. 
Having algorithmic independence is important for two reasons. First, each mod- 
eling component may be optimized separately leading to computationally more 
efficient algorithms. Second, a large variety of active contour behaviors may be 
obtained by combining different algorithms for each modeling component. 

2 Discretization of active contours 

In the remainder, we consider the deformation over time of a two-dimensional 
parametric contour C {u, t) € IR^ where u designates the contour parameter and 
t designates the time. The parameter u belongs to the range [0, 1] with C(0, t) = 
C{l,t) if the contour is closed. We formulate the contour deformation with a 
Newtonian law of motion: 



Q_l_2 — dt 

where f-mt and /ext correspond respectively to internal and external forces. A 
contour may include several connected components, each component being a 
closed or opened contour. 

Temporal and spatial discretizations oiC{u, t) are based on finite differences. 
Thus, the set of N* vertices {p-}, z = 0 . . . iV* — 1 represents the contour C(zt, t) 
at time t. The discretization of equation 1 using centered and right differences 
for the acceleration and speed term leads to: 

== Pi + (1 - 27 )(p! - P*"^) + a*(/int)i + /3i(/ext)i- (2) 

In order to simplify the notation, we will write Pi instead of p- the vertex 
position at time t. At each vertex p^, we define a local tangent vector t^, normal 
vector tii, metric parameter Ci and curvature ki. We propose to define the tangent 
vector at Pi, as the direction of the line joining its two neighbors: = (pi+i — 




New Algorithms for Controlling Active Contours Shape and Topology 383 



Pi_i)/(2ri) where = ||pi+i — Pi_i||/2 is the half distance between the two 
neighbors of Pi . The normal vector is defined as the vector directly orthogonal 
to ti'.iii — with (a;, y)-^ = {—y, x). The curvature ki is naturally defined as the 
curvature of the circle circumscribed at triangle (pi_i, p^, p^+i). If we write as 4h, 
the oriented angle between segments [pi_i, pi] and [pi, Pi+i], then the curvature 
is given by ki = sin(^i)/ri. Finally, the metric parameter measures the relative 
spacing of Pi with respect to its two neighboring vertices Pi_i and Pi+i- If 
is the projection of Pi on the line [pi_i,pi+i], then the metric parameter is: 
Ci = ||Fi - pi+i||/(2ri) = 1 - ||Fi+i - p*_i||/(2ri). In another words, and 
1 — Ci are the barycentric coordinates of F^ with respect to Pi_i and Pi+i: 

Fi = eip*_i + (1 - ei)pi+i. 




Fig. 1. Left: The geometry of a discrete contour; definition of ti, iii, ki, tpi, and Fi. 
Right: The internal force associated with the curvature-conservative flow is proportional 
to Pi - Pi. 

Other definitions for the tangent and normal vectors could have been cho- 
sen. However, our tangent and normal vectors definitions has the advantage of 
providing a simple local shape description: 

Pi = eiPi_i + (1 - ei)pi+i -h L{r„(j)i, ti)ni, (3) 

where L{ri,4>i,ei) = + M\/l + 4e(l - e) tan^ 4>) with y, = 1 if \4>\ < tt/2 

and ^ = — 1 if 101 > 7t/2. Equation 3 simply decomposes vertex position pi into 
a tangential and normal component. The importance of this equation will be 
revealed in sections 3. 

3 Parameterization control 

For a continuous active contour C{u,t), the contour parameterization is char- 
acterized by the metric function: g{u,t) = ||f£||. If g(u,t) = 1 then the para- 
meter of C{u,t) coincides with the contour arc length. For a discrete contour, 
the parameterization corresponds to the relative spacing between vertices and is 
characterized by gt — ||pi — Pi_i||. 

For a continuous representation, parameterization is clearly independent of 
the contour shape. For a discrete contour represented by finite differences, shape 
and parameterization are not completely independent. The effect of parameter- 
ization changes is especially important at parts of high curvature. Therefore, 
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parameterization is an important issue for handling discrete parametric con- 
tours. In this section we propose a simple algorithm to enforce two types of 
parameterization: 

1. uniform parameterization: the spacing between consecutive vertices is 
uniform. 

2. curvature-based parameterization: vertices are concentrated at parts of 
high curvature. This parameterization tends to optimize the shape descrip- 
tion for a given number of vertices. 



To modify a contour parameterization, only the tangential component of 
the internal force should be considered. Indeed, Kimia et al [9] have proved that 
only the normal component of the internal force applied on a continuous contour 
C{u, t) has an influence on the resulting contour shape. Therefore, if t, n are the 
tangent and normal vector at a point C (u, t ) , then the contour evolution may 
be written as: ^ = /int = a{u,t)t -I- b{u,t)n. Kimia et al [9] show that only 
the normal component of the internal force &(tt, t) modifles the contour shape 
whereas the metric function g{u, t) = |i §^|| evolution is dependent on a(u, t) and 
h{u, t): 



% 

dt 



da(u, t) 
du 



b{u, t)kg. 



(4) 



The tangential component of the internal force a{u, t)t constrains the nature 
of the parameterization. We propose to apply this principle on discrete para- 
metric contours as well by decomposing the internal force /i„t into its normal 
and tangential components: = {ftg)i + (/nr)i with (/tg)i • Ui = 0, and 

(/nr)i • ti = 0. More precisely, since the tangent direction t^ at a vertex is the 
line direction joining its two neighbors, we use a simple expression for the tan- 
gential component: {ftg)i = (e*-ei)(pi+i -pi_i) = 2ri{e*-e^)ti where e* is the 
reference metric parameter whose value depends on the type of parameterization 
to enforce. 



3.1 Uniform vertex spacing 

To obtain evenly spaced vertices, we simply choose: e* = 4. This tangential 
force moves each vertex in the tangent direction towards the middle of its two 
neighbors. When the contour reaches its equilibrium, i.e. when {ftg)i — 0, pi 

is then equidistant from Pi-i and Pi+i. It equals to (/tg)i = • t^ t = 

||t. Because the second derivative vector is the first variation of the weak 
string internal energy (J^ || this force is somewhat related to the classical 

“snakes” approach proposed in [8]. 

3.2 Curvature based vertex spacing 

To obtain an optimal description of shape, it is required that vertices concentrate 
at parts of high curvature and that flat parts are only described with few vertices. 
To obtain such parameterization, we present a method where edge length is 
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inversely proportional to curvature. If Ci is the edge joining pi and Pi+i, then 
we compute its edge curvature as the mean absolute curvature of its two 
vertices: = {\ki\ + |fci+i|)/2. Then at each vertex pi, we can compute the 

local relative variation of absolute curvature AKi C [—1,1] as: AKi = — 

Kl_i ) / To enforce a curvature-based vertex spacing, we compute 
the reference metric parameter e* as: e* = ^ — 0.4 * AKi. 

When vertex pi is surrounded by two edges having the same curvature then 
AKi = 0 and therefore e* is set to \ which implies that Pi becomes equidistant 
from its two neighboring vertices. On the contrary, when the absolute curvature 
of Pi+i is greater than the absolute curvature of Pi_i then AKi becomes close 
to 1 and therefore e* is close to 0.1 which implies that Pi moves towards Pi+i. 

3.3 Results of vertex spacing constraints 

To illustrate the ability to decouple parameterization and shape properties, we 
propose to apply an internal force that modifies the vertex spacing on a contour 
without changing its shape. We define a curvature conservative regularizing force 
that moves Pi in the normal direction in order to keep the same local curvature: 

/nr — Cj ) Li^Vi^ (j)i^ (5) 

This equation has a simple geometric interpretation if we note that the total 
internal force /i„t = /tg + /nr is simply equal to p* — pi where p* is the point 
having the same curvature as Pi but with a metric parameter e*. From right of 
figure 1, we can see that /tg corresponds to the displacement between F* and 
Fi whereas /nr corresponds to the difference of elevation between p* and pt. 

Given an open or closed contour we iteratively apply differential equation 2 
with the internal force expression described above. Figure 2 shows an example of 
vertex spacing constraint enforced on a closed contour consisting of 150 vertices. 
The initial vertex spacing is uneven. When applying the uniform vertex spacing 
tangential force (e* = 0.5), after 1000 iterations, all contour edge lengths become 
equal within less 5 percent without greatly changing the contour shape, as shown 
in figure 2 (upper row) . The diagram displays the distribution of edge curvature 
as a function of edge length. Similarly, with the same number of iterations, the 
contour evolution using the curvature-based vertex spacing force tends to con- 
centrate vertices at parts of high curvature. The corresponding diagram clearly 
shows that edge length is inversely proportional to edge curvature. 

3.4 Contour resolution control 

In addition to constraining the relative spacing between vertices, it is important 
to control the total number of vertices. Indeed, the computational complexity of 
discrete parametric contours is typically linear in the number of vertices. In order 
to add or remove vertices, we do not use any global contour reparameterization 
as performed in the level-set method [19] because of its high computational cost. 
Instead, we propose to locally add or remove a vertex if the edge length does not 
belong to a given distance range, similarly to [7,12]. Our resolution constraint 
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Fig. 2. (up) contour after applying the uniform vertex spacing tangential force; (bot- 
tom) contour after applying the curvature-based vertex spacing constraint. 

algorithm proceeds as follows. Given two thresholds Smin and Smax corresponding 
to the minimum and maximum edge length, we scan all existing contour edges. If 
the current edge length is greater than Smax and 2 * Smin then a vertex is added. 
Otherwise if current edge length is less than Smin and if the sum of the current 
and previous edge length is less than Smax, then the current vertex is removed. 
In general, this procedure is called every /rresoiution = 5 deformation iterations. 

4 Shape regularization 

The two internal forces defined in previous section have little influence on the 
contour shape evolution because they are only related to the contour parame- 
terization. In this section, we deal with the internal force normal component 
which determines the contour shape regularization. The most widely used in- 
ternal forces on active contours are the mean curvature motion [13], Laplacian 
smoothing, thin rod smoothing or spring forces. 

Laplacian Smoothing and Mean Curvature Motion have the drawback of sig- 
nificantly shrinking the contour. This shrinking effect introduces a bias in the 
contour deformation since image structures located inside the contour are more 
likely to be segmented than structures located outside the contour. Furthermore, 
the amount of shrinking often prevents active contours from entering inside fine 
structures. 

To decrease the shrinking effect, Taubin [20] proposes to apply a linear fil- 
ter to curves and surfaces in order to reduce the shrinking effect of Gaussian 
smoothing. However, these two methods only remove the shrinking effect for a 
given curvature scale. For instance, when smoothing a circle, this circle would 
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stay invariant only for one given circle radius which is related to a set of filter- 
ing parameters. Therefore, in these methods, the choice of these parameters are 
important but difficult to estimate prior to the segmentation process. A regular- 
izing force with higher degrees of smoothness such as the Thin Rod Smoothing 
causes significantly less shrinking since it is based on fourth derivatives along the 
contour. However, the normal component of this force — (f;^ • n)n is dependent 
on the nature of the parameterization which is a serious limitation. 



4.1 Curvature diffusion regularization 



We propose to use the second derivative of curvature with respect to arc length 
as the governing regularizing force: 



d^k 



( 6 ) 



This force tends to diffuse the curvature along the contour, thus converging 
towards circles, independently of their radii, for closed contours. For the dis- 
cretization of equation 6, we do not use straightforward finite differences, since 
it would lead to complex and potentially unstable schemes. Instead, we propose 
a geometry-based implementation that is similar to equation 5: 



/nr — L[ri, (pi, ei))u.i (7) 

t 2 n 

where (p* is the angle at a point p* for which ^ — 0. The geometric interpreta- 
tion of equation 7 is also straightforward, since the internal force f-mt = /tg + /nr 
corresponds to the displacement p* — Pi. The angle (p* is simply computed by 
(p* = arcsin(fc* * Vi) where k* is the curvature at p*. Therefore, k* is simply 
computed as the local average curvature weighted by arc length: 

_ ||PiPr-l||fci+i -b \\piPi+i\\h-i 

* ||PrPi-Hl|! + ||PiP*-l|| 

Furthermore, we can compute the local average curvature over a greater neigh- 
borhood which results in increased smoothness and faster convergence, ff (Xi > 0 
is the scale parameter, and U,i+j is the distance between p^ and Pi+j then we 
compute ki as : 



— 1 li,i— jki-\-j -|- li^i-\-jki—j 

— 1 T li,i+j 



with I 



i,i+j 



3 

^||p*+fe-lP*+fc|j. 

k=l 



This scheme generalizes the intrinsic polynomial stabilizers proposed in [5] that 
required a uniform contour parameterization. Because of this regularizing force is 
geometrically intrinsic, we can combine it with a curvature-based vertex spacing 
tangential force, thus leading to optimized computations. Finally, the stability 
analysis of the explicit integration scheme is linked to the choice of at . We have 
found experimentally, without having a formal proof yet, that we obtain a stable 
iterative scheme if we choose at < 0.5. 
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5 Topology constraints 

Automatic topology changes of parametric contour has been previously proposed 
in [11, 14, 10]. In Mclnerney et al approach, all topological changes occur by com- 
puting the contour intersections with a simplicial decomposition of space. The 
contour is reparameterized at each iteration, the intersections with the simplicial 
domain being used as the new vertices. Recently Lachaud et al [10] introduced 
topologically adaptive deformable surfaces where self-intersections are detected 
based on distance between vertices. Our algorithm is also based on a regular lat- 
tice for detecting all contour intersections. However, the regular grid is not used 
for changing the contour parameterization and furthermore topology changes 
result from the application of topological operators. Therefore unlike previous 
approaches, we propose to completely decouple the physical behavior of active 
contours (contour resolution and geometric regularity) with their topological be- 
havior in order to provide a very flexible scheme. Finally, our framework applies 
to closed or opened contours. 

A contour topology is defined by the number of its connected components 
and whether each of its components is closed or opened. Our approach consists 
in using two basic topological operators. The first operator illustrated in figure 3 
consists in merging two contour edges. Depending whether the edges belong 
to the same connected component or not, this operator creates or remove a 
connected component. The second topological operator consists in closing or 
opening a connected component. 




Fig. 3. Topological operator applying on (left) two edges on the same connected com- 
ponent or (right) two different connected components. 



Our approach for modifying a contour topology can be decomposed into three 
stages. The first stage creates a data structure where the collision detection 
between contour connected components is computationally efficient. The second 
determines the geometric intersection between edges and the last stage actually 
performs all topological modifications. 

5.1 Data structure for the detection of contour intersections 

Finding pairs of intersecting edges has an a priori complexity of 0{in?) where n 
is the number of vertices (or edges) . Our algorithm is based on a regular grid of 
size d and has a complexity linear with the ratio C jd where £ is the length of the 
contour. Therefore, unlike the approach proposed in [14], our approach is not 
region-based (inside or outside regions) but only uses the polygonal description 
of the contour. 
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The two dimensional Euclidean space with a reference frame (o, x, y) is de- 
composed into a regular square grid which size d is user-defined. The influence 
of the grid size d is discussed in section 6.1. In this regular lattice, we define 
a point of row and column indices r and c as the point of Cartesian coordi- 
nates Ogrid -l-rx + cy where Og^id is the grid origin point. This point is randomly 
determined each time topology constraints are activated in order to make the al- 
gorithm independent of the origin choice. Furthermore, we define a square cell of 
index (r, c) as the square determined by the four points of indices (r, c), (r+ 1, c), 
(r -I- 1, c -1- 1) and (r, c -I- 1). 

In order to build the sampled contour, we scan all edges of each connected 
components. For each edge, we test if it intersects any row or columns of the 
regular lattice. Since the row and column directions correspond to the directions 
X and y of the coordinate frame, these intersection tests are efficiently computed. 
Each time an intersection with the row or column direction is found, a grid 
vertex is created and the intersecting contour edge is stored in the grid vertex . 
Furthermore, a grid vertex is stored in a grid edge structure. A grid edge is either 
a pair of grid vertices or a grid vertex associated with an end vertex (when the 
connected component is an opened line). Finally, the grid edge is appended to 
the list of grid edges inside the corresponding grid cell. 



Grid Vertex Associated Contour Edge 




the regular grid; (right) Definition of grid vertex, grid edge, grid cell and contour edge 
associated with a grid edge. 

5.2 Finding intersecting grid edges 

In order to optimize memory space, we store all non-empty grid cells inside a 
hash table, hashed by its row and column indices. The number of grid cells is 
proportional to the length C of the contour. In order to detect possible contour 
intersections, each entry to the hash table is scanned. For each cell containing 
n grid edges with n > 1, we test the intersection between all pairs of grid edges 
(see figure 4, left). Since each grid edge is geometrically represented by a line 
segment, this intersection test only requires the evaluation of two dot products. 

Once a pair of grid edges has been found to intersect, a pair of contour edges 
must be associated for the application of topological operators (see section 5.3). 
Because a contour edge is stored in each grid vertex, one contour edge can be 
associated with each grid edge. Thus, we associate with each grid edge, the 
middle of these two contour edges (in terms of topological distance) as shown in 
figure 4, right. 
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Our contour edges intersection algorithm has the following properties: (i) if 
one pair of grid edges intersects then there is at least one pair of contour edges 
that intersects inside this grid cell; and (ii) if a pair of contour edges intersects 
and if the corresponding intersecting area is greater than d* d then there is a 
corresponding pair of intersecting grid edges. In another words, our method does 
not detect all intersections but is guaranteed to detect all intersection having 
an area greater than d * d. In practice, since the grid origin OgHd is randomly 
determined each time the topology constraint is enforced, we found that our 
algorithm detected all intersections that are relevant for performing topology 
changes. 

5.3 Applying topological operators 

All pairs of intersecting contour edges are stored inside another hash table for 
an efficient retrieval. Since in general two connected components intersect each 
other at two edges, given a pair of intersecting contour edges, we search for the 
closest pair of intersecting contour edges based on topological distance. If such 
a pair is found, we perform the following tasks. If both edges belong to the same 
connected component, then the they are merged if their topological distance 
is greater than a threshold (usually equal to 8). This is to avoid creating too 
small connected components. In all other cases, the two edges are merged with 
the topological operator presented in figure 3. Finally, we update the list of 
intersecting edge pairs by removing from the hash table all edge pairs involving 
any of the two contour edges that have been merged. 

5.4 Other applications of the collision detection algorithm 

The algorithm presented in the previous sections merges intersecting edges re- 
gardless of the nature of the intersection. If it corresponds to a self-intersection, 
then a new connected component is created, otherwise two connected compo- 
nents are merged. As in [14] our framework can prevent the merging of two 
distinct connected components while allowing the removal of self-intersections. 
To do so, when a pair of intersecting contour edges belonging to distinct con- 
nected components is found, instead of merging this edges, we align all vertices 
located between intersecting edges belonging to the same connected component 
(see figure 5). Thus, each component pushes back all neighboring components. 
In figure 5, right, we show an example of image segmentation where this repul- 
sive behavior between components is very useful in segmenting the two heart 
atriums. 

6 Results 

6.1 Topology algorithm cost 

We evaluate the performance of our automatic topology adaptation algorithm 
on the example of figure 6. The contour consisting of 50 vertices, is deformed 
from a circular shape towards a vertebra in a CT image. The computation time 
for building the data structure described in section 5.1 is displayed in figure 6, 
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Fig. 5. (from left to right) Two intersecting connected components; after merging the 
two pairs of intersecting edges; after aligning vertices along the intersecting edges. 
Example of 2 active contours reconstructing the right and left ventricles in a MR 
image with a repulsive behavior between each components. 



right, as a function of the grid size d. It varies from 175 ms to 1 ms when 
the grid size increases from 0.17 to 10 image pixels on a Digital PWS 500 Mhz. 
The computation time for applying the topological operators can be neglected in 
general. When the grid size is equal to the mean edge distance (around 2 pixels), 
the computation time needed to detect edge intersections becomes almost equal 
to the computation time needed to deform the contour during one iteration (4.8 
ms). 





Pig. 6. Segmentation of a vertebra in a CT image; Topology algorithm computation 
time. 

When the grid size increases, the contour sampling on the regular grid be- 
comes sparse and therefore some contour intersections may not be detected. 
However, we have verified that topological changes still occur if we choose a grid 
size corresponding to 20 image pixels with contour intersections checked every 20 
iterations. In practice, we choose a conservative option with a grid size equal to 
the average edge length and with a frequency for topology changes of 5 iterations 
which implies an approximate additional computation time of 20 percent. 

6.2 Segmentation example 

This example illustrates the segmentation of an aortic arch angiography. Figure 7 
shows the initial contour (up left) and its evolution towards the aorta and the 
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main vessels. External forces are computed as a function of vertex distance to 
a gradient point to avoid oscillations around image edges and are projected on 
the vertex normal direction. The contour is regularized by a curvature diffusive 
constraint. The contour resolution constraint is applied every 10 iterations which 
makes the resampling overhead very low. Topology constraints are computed 
every 5 iterations on a 4 pixel grid size to fuse the self-intersecting contour 
parts. Intersections with image borders are computed every 10 iterations and 
the contour is opened as it reaches the image border. 




Fig. 7. Evolution of a closed curve towards the aortic arch and the branching vessels. 



7 Comparison with the level-set method 

The main advantage of the level-set method is obviously its ability to automati- 
cally change the contour topology during the deformation. This property makes 
it well-suited for reconstructing contours of complex geometries for instance tree- 
like structures. Also, by merging different intersecting contours, it is possible to 
initialize a deformable contour with a set of growing seeds. However, the ma- 
jor drawbacks of level-sets methods are related to their difficult user interaction 
and their computational cost, although some speed-up algorithms based on con- 
straining the contour evolution through the Fast-Marching method [19] or by 
using an asynchronous update of the narrow-band [18] have been proposed. The 
formal comparison between both parametric and level-set approaches have been 
recently established in the case of geodesic snakes [1]. In this section, we pro- 
pose a practical comparison between both approaches including implementation 
issues. 

7.1 Level-set implementation 

The level-set function is discretized on a rectangular grid whose resolution 
corresponds to the image pixel size. The evolution equation is discretized in space 
using finite differences in time using an explicit scheme, leading to : — 

'Pfj + AtvijW^ ij'l'fjW [13] where Vij denotes the propagation speed term and At 
is the discrete time step. 

The propagation speed term i/ is designed to attract C towards object bound- 
aries extracted from the image using a gradient operator with an additional reg- 
ularizing term: i/(p) = /3(p) (k(p) +c). k{p) denotes the contour curvature at 
point p while c is a constant resulting in a balloon force [4] on the contour. Fi- 
nally, (3{p) € [0, 1] is a multiplicative coefficient dependent on the image gradient 
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Fig. 8. (Five left figures) Discrete contour deformation; (Five right figures) level-set 
deformation. 

norm at point p. When C moves across pixels of high gradients, this term slows 
down the level-set propagation. A threshold parameter determines the minimal 
image boundary strength required to stop a level-set evolution. We speed-up 
the level-set by using a narrow band method [13] which requires to periodically 
reinitialize the level-set contour. 

In order to compare active contours to level-sets, we compute external forces 
similar to level-set propagation at discrete contour vertices. First, we use mean 
curvature motion as the governing internal force and we have implemented for 
the external force, a balloon force weighted by the coefficient /3 proposed above. 
Finally, we were able to use the same gradient threshold in both approaches. 

7.2 Torus example 

We first propose to compare both approaches on the synthetic image shown in 
figure 8. This image has two distinct connected components. 

A discrete contour is initialized around the two components. A medium grid 
size (8 pixels resolution) is used and topology constraints are computed every 
10 iterations. Throughout the deformation process, vertices are added and re- 
moved to have similar edge length along the contour. A corresponding level-set 
is initialized at the same place that the discrete contour. A 7 pixel wide narrow 
band appeared to optimize the convergence time. A 0.3 time step is used. It is 
the maximal value below which the evolving curve is stable. 

Figure 8 shows the convergence of the discrete contour (left) and the level-set 
(right). The discrete contour converges in 0.42 seconds opposed to 3.30 seconds 
for the level-set, that is a 7.85 acceleration factor in favor of the discrete contour. 
The difference of computational time is due to the small vertex number used for 
the discrete contour (varying between 36 and 48 vertices) compared to the much 
greater number of sites (from 1709 up to 3710) updated in the level-set narrow 
band. 

7.3 Synthetic data 

This experiments shows the ability of the discrete contour topology algorithm 
to follow difficult topology changes. We use a synthetic fractal image showing a 
number of small connected components. Figure 9, upper row, shows the discrete 
contour convergence in the image while the bottom row shows the level set 
convergence. 

In both cases, the initial contour is a square located at the image border. It 
evolves under a deflation force that stops on strong image boundaries. For the 
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Fig. 9. (upper row) Discrete contour convergence in a fractal image; (bottom row) 
Level-set convergence in the same image. 

discrete contour, a small grid size is used due to the small image structure size 
(4 pixels grid size and 5 iterations algorithm frequency). A weak regularizing 
constraint allows the contour to segment the square corners. The contour is 
checked every 10 iterations to add the necessary vertices. A 0.3 time step is used 
for the level-set. This high value leads to a rather unstable behavior as can be 
seen in figure 9. As the level-set contours gradually fills-in the whole image, we 
have verified that the convergence time is not minimized by using any narrow 
bands. Again, the speed-up is 3.84 in favor of the discrete contour. 

8 Conclusion 

We have introduced three algorithms that greatly improve the generality of para- 
metric active contours while preserving their computational efficiency. Further- 
more, these algorithms are controlled by simple parameters that are easy to 
understand. For the internal force, a single parameter a between 0 and 1 is used 
to set the amount of smoothing. For resolution and topology constraint algo- 
rithms, distance parameters must be provided as well as the frequency at which 
they apply. Given an image, all these parameters can be set automatically to 
meaningful values providing good results in most cases. 

Finally, we have compared the efficiency of this approach with the level set 
method by implementing parametric geodesic snakes. These experiments seem 
to conclude that our approach is at least three times as fast as the implicit 
implementation. Above all, we believe that the most important advantage of 
parametric active contours is their user interactivity. 
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Abstract. This paper presents a new Bayesian framework for layered 
motion segmentation, dividing the frames of an image sequence into fore- 
ground and background layers by tracking edges. The first frame in the 
sequence is segmented into regions using image edges, which are tracked 
to estimate two affine motions. The probability of the edges fitting each 
motion is calculated using 1st order statistics along the edge. The most 
likely region labelling is then resolved using these probabilities, together 
with a Markov Random Field prior. As part of this process one of the 
motions is also identified as the foreground motion. 

Good results are obtained using only two frames for segmentation. How- 
ever, it is also demonstrated that over multiple frames the probabilities 
may be accumulated to provide an even more accurate and robust seg- 
mentation. The final region labelling can be used, together with the two 
motion models, to produce a good segmentation of an extended sequence. 



1 Introduction 

Video segmentation is a first stage in many further areas of video analysis. For 
example, there is growing interest in video indexing - where image sequences 
are indexed and retrieved by their content - and semantic analysis of an image 
sequence requires moving objects to be distinguished from the background. Fur- 
ther, the emerging MPEG-4 standard represents sequences as objects on a series 
of layers, and so these objects and layers must be identified to encode a video 
sequence. 

A recent trend in motion segmentation is the use of layers PIE|. This avoids 
some of the traditional multiple-motion estimation problems by assuming that 
motion within a layer is consistent, but layer boundaries mark motion disconti- 
nuities. The motions and layers may be estimated using the recursive dominant 
motion approach pom, or by fitting many layers simultaneously pnrrarnTirrTj . 

Motion estimation is poor in regions of low texture, and here the structure of 
the image has to play a part. Smooth regions are expected to move coherently, 
and changes in motion are more likely to occur at edges in the image. A common 
approach is to use the local image intensity as a prior when assigning pixels to 

D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. .lOfi- BTfl 2000. 

(c) Springer- Verlag Berlin Heidelberg 2000 



Motion Segmentation by Tracking Edge Information over Multiple Frames 397 



layers uniiiniinj. The normalized cuts method of Shi and Malik p2] can combine 
both the motion and intensity information of pixels into a weighted graph, for 
which the best partition has then to be found. 

Alternatively, the image structure may be considered before the motion es- 
timation stage by performing an initial static segmentation of the frame based 
on pixel colour or intensity. This reduces the problem to one of identifying the 
correct motion labelling for the each region. Both Bergen and Meyer [2| and 
Moscheni and Dufaux (S| have had some success in merging regions with similar 
motion fields. 

This paper concentrates on the edges in an image. Edges are very valuable 
features to consider, both for motion estimation and segmentation. Object track- 
ing is commonly performed using edge information (in the form of snakes) , while 
image segmentation techniques naturally use the structure cues given by edges. 
If an image from a motion sequence is already segmented into regions of similar 
colour or intensity along edges, it is clear that a large proportion of the motion 
information will come from these edges rather than the interior of regions. This 
paper shows how this edge information alone is sufficient to both estimate and 
track motions, and label image regions. 

Many papers on motion segmentation avoid the question of occlusion or the 
ordering of layers. Occluded pixels are commonly treated as outliers which the 
algorithm has to be able to tolerate, although reasoned analysis and modelling 
of these outliers can be used to retrieve the layer ordering and identify occluded 
regions usiini With the edge-based method proposed in this paper, the prob- 
lem of occluded pixels is greatly reduced since it is only the occluding boundary, 
and not the region below, which is being tracked. Furthermore, the relationship 
between edges and regions inherently also depends on the layer ordering, and 
this is extracted as an integral part of the algorithm. 

This paper describes a novel and efficient framework for segmenting frames 
from a sequence into layers using edge motions. The theory linking the motions 
of edges and regions is outlined and a Bayesian probabilistic framework devel- 
oped to enable a solution for the most likely region labelling to be inferred from 
edge motions. This work extends the approach first proposed in H3, developing 
more powerful probabilistic models and demonstrating that evidence may be ac- 
cumulated over a sequence to provide a more accurate and robust segmentation. 

The theoretical and probabilistic framework for analysing edge motions is 
presented in Sect. El The current implementation of this theory is outlined in 
Sect. El with experimental results presented in Sect. El 



2 Theoretical Framework 

Edges in the image are important features since the desired segmentation di- 
vides the image along occluding edges of the foreground object (or objects) in 
the image. Edges are also very good features to consider for motion estimation: 
they can be found more reliably than corners and their long extent means that 
a number of measurements may be taken along their length, leading to a more 
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accurate estimation of their motion. However, segmentation ultimately involves 
regions, since the task is one of labelling image pixels according to the motions. 
If it is assumed that the image is already segmented into regions along edges, 
then there is a natural link between the regions and the edges. In this section the 
relationship between the motion of regions and edges is outlined and a proba- 
bilistic framework is developed to enable a region labelling to be estimated from 
edge data. 

2.1 The Image Motion of Region Edges 

Edges in an image are due to the texture of objects, or their boundaries in the 
scene. Edges can also be due to shadows and specular reflections, but these are 
not considered at this stage. It is assumed that as an object moves all of the edges 
associated with the object move, and hence edges in one frame may be compared 
with those in the next and partitioned according to different real-world motions. 

The work in this paper assumes that the motion in the sequence is layered 
i.e. one motion takes place completely in front of another. Typically the layer 
farthest from the camera is referred to as the background, with foreground layers 
in front of this. It is also assumed that any occluding boundary (the edge of a 
foreground object) is visible in the image. With regions in the image defined 
by the edges, this implies that each region obeys only one motion, and an edge 
which is an occluding boundary will have the motion of the occluding region. 
This enables a general rule to be stated for labelling edges from regions: 

Labelling Rule: The layer to which an edge belongs is that of the nearer 
of the two regions which it bounds. 

2.2 Probabilistic Formulation 

There are a large number of parameters which must be solved to give a complete 
motion segmentation. In this section a Bayesian framework is developed to enable 
the most likely value of these parameters to be estimated. 

The complete model of the segmentation, M, consists of the elements M = 
{0,F,R} where 

0 is the parameters of the layer motion models, 

F is the depth ordering of the motion layers, 

R is the motion label (layer) for each region. 

The region edge labels are not part of the model, but are completely defined by 
R and F from the Labelling Rule of Sect. O 

Given the image data D (and any other prior information assumed about 
the world), the task is to And the model M with the maximum probability given 
this data and priors Q 

maxP (MID) = max P (RF0\D) . (1) 

M RF@ 

^ Throughout this paper, max is used to also represent argmax, as frequently both 
the maximum value and the parameters giving this are required. 
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This can be further decomposed into a motion estimation component and region 
labelling: 

max V{RF0\D) = max P (0|£)) P (HT’IQZ?) . (2) 

RF & RF & 

At this stage a simplification is made: it is assumed that the maximum value 
(not the model parameters which give this) of @ is independent of the mo- 
tion, and thus the motion parameters 0 can be maximised independently of the 
others. The expression to be maximised is thus 

maxP(0|£))maxP(i2F|0Z)), (3) 

@ RF 

a b 

where the value of & used in term (b) is that which maximises term (a). The 
two components of m can be evaluated in turn: first (a) and then (b). 



(a) Estimating the Motions 0. The first term in o estimates the motions 
between frames, which this may be estimated by tracking features. As outlined 
in Sect. VZ.Ll edges are robust features to track and they also provide a natural 
link to the regions which are to be labelled. 

In order to estimate the motion models from the edges it is necessary to know 
which edges belong to which motion, which is not something that is known a 
priori. In order to resolve this, another random variable is introduced, e, which 
is the labelling of an edge: which motion the edge obeys. The motion estimation 
can then be expressed in terms of an Expectation-Maximisation problem jS]: 

fp(e|0„0) E-stage 

jmax 0 „_^^ P (0„+i \eD) P (e|0„0) M-stage . 

Starting with an initial guess of the motions, the expected edge labelling is 
estimated. This edge labelling can then be used to maximise the estimate of the 
motions, and the process iterates until convergence. 



(b) Estimating the Labellings R and F. Having obtained the most likely 
motions, the remaining parameters of the model M can be maximised. Once 
again, the edge labels are used as an intermediate step. The motion estimation 
allows the edge probabilities to be estimated, and from Sect. l'z!. li the relationship 
between edges and regions is known. Term (|2t)) can be augmented by the edge 
labelling e, which must then be marginalised, giving 

maxP {RF\0D) = max^~~^ P (i?E|e) P (e|00) , (5) 

RF RF ' ^ 

e 

since R and F are conditionally independent of F> given e (which is entirely 
defined by R and F). 
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The second term, the edge probabilities, can extracted directly from the 
motion estimation stage. The first term is more difficult to estimate, and it is 
easier to recast this using Bayes’ Rule, giving 



^ ^ P(e) 



( 6 ) 



The maximisation is over R and F, so P (e) is constant. It can also be assumed 
that the priors of R and F are independent, and any foreground motion is equally 
likely, so P {F) is constant. The last term, the prior probability of a particular 
region labelling P (i?), is not constant, which leaves the following expression to 
be evaluated: 

max^P(e|i2F)P(il)P(e|0£)) . (7) 

e 

The P (e|i?i^) term is very useful, e is only an intermediate variable, and is 
entirely defined by the region labelling R and the foreground motion F . This 
probability therefore takes on a binary value - it is 1 if that edge labelling is 
implied and 0 if it is not. The sum in 0 can thus be removed, and the e in 
the final term replaced by a function of R and F which gives the correct edge 
labels: 

maxP(e(fi,F)|0Z))P(fi) . (8) 

RF ^ ^ ^ ^ 

a b 

The variable F takes only a discrete set of values (in the case of two layers, 
only two: either one motion is foreground, or the other). Equation (0 can there- 
fore be maximised in two stages: F can be fixed at one value and the expression 
maximised over R, and the process then repeated with other values of F and 
the global maximum taken. 

The maximisation over R can be performed by hypothesising a complete re- 
gion labelling and then testing the evidence ((HK) ~ calculating the probability of 
the edge labelling given the regions and the motions - and the prior cal- 
culating the likelihood of that particular labelling configuration. An exhaustive 
search is impractical, and in the implementation presented here region labellings 
are hypothesised using simulated annealing. 



3 Implementation 

This section outlines the implementation of the framework presented in Sect. Q 
for two layers (foreground and background), with the motions of each modelled 
by an affine motion. The basic implementation is divided into three sections (see 
Fig. [H): 

1. Find edges and regions in the first frame 

2. Estimate the motions and edge probabilities 

3. Label the regions and foreground motion 

The second two stages can then be continued over subsequent frames and the 
edge probabilities accumulated. 
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(a) Initial static seg- 
mentation 



(b) Edges labelled as 
motion 1 or motion 2 



(c) Foreground regions 



Fig. 1. ‘ Foreman’ segmentation from two frames. The foreman moves his head very 
slightly to the left between frames, but this is enough to accurately estimate the motions 
and calculate edge probabilities. The foreground motion can then be identified and the 
regions labelled to produce a good segmentation of the head. 



3.1 Finding Edges and Regions 

To implement the framework outlined in Sect.|2l regions and edges must first 
be located in the image. The implementation presented here uses a scheme de- 
veloped by Sinclair m but other edge-based schemes, such the morphological 
segmentation used in 0, are also suitable. 

Under Sinclair’s scheme, colour edges are found in the image and seed points 
for region growing are then found at the locations furthest from these edges. 
Regions are grown, by pixel colour, with image edges acting as hard barriers. The 
result is a series of connected, closed region edges generated from the original 
fragmented edges (see Fig. |l(a)| l. The edges referred to in this paper are the 
region boundaries: each boundary between two distinct regions is an edge. 

3.2 Estimating the Motions 0 

As described in Sect. f2.21 the problem of labelling the segmented regions can be 
divided into two stages: first estimating the motions and then the motion and 
region labelling. In order to estimate the motions, features are tracked from one 
frame to the next; the obvious features to use are the region edges. The motion is 
parameterised by a 2D affine transformation, which gives a good approximation 
to the small inter-frame motions. 

Multiple-motion estimation is a circular problem. If it were known which 
edges belonged to which motion, these could be used to directly estimate the 
motions. However, edge motion labelling cannot be performed without knowing 
the motions. In order to resolve this, Expectation-Maximisation (EM) is used 
0, implementing the formulation outlined in 0 as described below. 

Edge Tracking. Both stages of the EM process make use of group-constrained 
snake technology ^ E| . For each edge, tracking nodes are assigned at regular 
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Fig. 2. Edge tracking example, (a) Edge in initial frame, (b) In the next frame the 
image edge has moved. Tracking nodes are initialised along the model edge and then 
search normal to the edge to find the new location. The best-fit motion is the one that 
minimises the squared distance error between the tracking nodes and the edge. 



intervals along the edge (see Fig. EJ. The motion of these nodes are considered 
to be representative of the edge motion (there are around 1,400 tracking nodes 
in a typical frame). The tracking nodes from the first frame are mapped into 
the next according to the current best guess of the motion. A 1-dimensional 
search is then made along the edge normal (for 5 pixels either direction) to find 
a matching edge pixel based on colour image gradients. The image distance d, 
between the original node location and its match in the next image, is measured 
(see Fig. 121(b)). 

At each tracking node the expected image motion due to the 2D affine motion 
0 can be calculated. The best fit solution is the one which minimises the residual: 

e 

over all edges e and tracking nodes t, where dt is the measurement and n{9,t) 
the component of the image motion normal to the edge at that image location. 
This expression may be minimised using least squares, although in practice an 
M-estimator (see, for example, M) is used to provide robustness to outliers. 



Maximisation: Estimating the Motions. Given the previous estimate of the 
motions 0, all tracking nodes are mapped into the next frame according to both 
motions. From each of the two possible locations a normal search is performed as 
described above and the best match found (or ‘no match’ is reported if none is 
above a threshold) . These distances are combined into the estimate of the affine 
motion parameters (jOI) in proportion to the current edge probabilities. 



Expectation: Calculating Edge Probabilities. For simplicity, it is assumed 
that the tracking nodes along each edge are independent and that tracker errors 
can be modelled by a normal distribution. Experiments have shown Gaussian 
statistics to be a good fit, and although the independence assumption is less 
valid (see Sec. l,S..'tj) . it still performs satisfactorily for the EM stage. 

By assuming independence, the edge probability under each motion is the 
product of the tracking node probabilities. Each tracking node tries to find a 
match under each of the two motions, yielding either an error distance di or 
finding no match above a threshold (denoted by di = G). There are three dis- 
tinct cases when matching under the two motions: a match is found under both 
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motions, under neither motion, or under only one motion. The probability dis- 
tributions for each case have been modelled from data by considering an ideal 
solution. 

Match Found Under Both Motions. The errors under both motions are modelled 
by normal distributions and, a priori, both are equally likely. The probability of 
a tracker belonging to motion 1 is given by the normalised probability 

P (Motion 1 1 ^1^2) = 7 ^ (10) 

l + exp(-^(di-df)) 

where, from data, l/2crj = 0.3. The probability of it belonging to motion 2 is, 
of course, (1 — P (Motion l|did2))- 

Match Found Under Only One Motion. A Gaussian was found to be a good fit 
to experimental data: 

P (Motion l|di, c?2 = G) = , (11) 

with a = 0.97 and f3 = 0.0265. The same equation holds, but with d^, if the 
single match were under motion 2 instead. 

No Match Found Under Either Motion. In this case, no information is available 
and a uniform prior is used: 

P (Motion l|di = ^2 = G>) = 0.5 . (12) 

Initialisation and Convergence. The EM is initialised with a guess of the 
two motions 0. For the first frame, the initial guesses are zero motion (the 
camera is likely to be stationary or tracking the foreground object) and the 
mean motion, estimated from the initial errors of all the edges. For subsequent 
frames, a velocity estimate is used (see Sec. 13.411 . For the first iteration of EM, 
the tracker search path is set at 20 pixels to compensate for a possible poor 
initialisation. 

Convergence is gauged by considering the Maximum A Posteriori labelling 
of each edge (either motion 1 or motion 2 depending on which is most likely) . If 
no edge changes labelling between two iterations then convergence is assumed. 
The maximum number of iterations is set at 40, which takes around 3 seconds 
on a 300MHz Pentium II. 

3.3 Labelling Regions R and finding the Layer Order F 

Having estimated the most likely motions 0, the second term of (El can be max- 
imised. This finds the most likely region labelling and identifies the motion most 
likely to be foreground. Using Q, this can be performed by hypothesising possi- 
ble region and foreground motion labellings and calculating their probabilities, 
selecting the most probable. 
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current tracker error 



current tracker error 



(a) Correct Motion 



(b) Incorrect Motion 



Fig. 3. Markov chain transition probabilities. These are used to calculate the probabil- 
ity of observing a particular sequence of tracking node errors along an edge. A residual 
of —6 corresponds to no match being found at that tracker under that motion. 



Region Probabilities from Edge Data. Given a hypothesised region la- 
belling and layer ordering, the edges can all be labelled as motion 1 or motion 2 
by following the Labelling Rule from Sect. 12.11 The probability of this region la- 
belling given the data (term (jSk)) is given by the probability of the edges having 
these labels. 

The edge probabilities used in the EM of Sec. 13.21 made the assumption 
that tracking node errors were independent. While this is acceptable for the 
EM, under this assumption the edge probabilities are too confident and can 
result in an incorrect region labelling solution. As a result, a more suitable edge 
probability model was developed. Correlations between tracking nodes along an 
edge can be decoupled using Markov chains, which encode 1st order probabilistic 
relationships. (Used, for example, by MacCormick and Blake H to make their 
contour matching more robust to occlusion.) These higher-order statistics cannot 
be used for the EM since they are only valid at (or near) convergence. However, 
to ensure that the EM solution maximises the Markov chain edge probabilities, 
the EM switches to the Markov chain model near convergence. 

The Markov chain models the relationship between one tracking node and the 
next along an edge, giving the probability of a tracking node having a certain 
residual di given the residual at the previous tracking node. These transition 
probabilities were estimated from data for the cases where an edge is matched 
under the correct motion and under the incorrect motion, and the modelled 
probabilities can be seen in Fig. El It is found that under the correct motion, a 
low residual distance is likely, and the residuals are largely independent (unless 
no match is found, in which case it is highly likely that the next tracking node 
will also find no match). Under the incorrect motion, the residual distances are 
highly correlated, and there is always a high probability that no match will be 
found. 
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There are two models of edge tracking node sequence formation: either mo- 
tion 1 is correct and motion 2 incorrect, or vice versa. Both are considered 
equally likely a priori. The chain probability is calculated from the product of 
the transition probabilities, and it is assumed that the probabilities under the 
correct and incorrect motions are independent and so the two can be multiplied 
to give the hypothesis probability. Finally, the two hypothesis probabilities must 
be normalised by their sum to give the posterior edge probability. 

The region probability given the data is the probability that all its edges 
obey the correct motions. It is assumed that the edges are independent, so (Et) 
can be evaluated by multiplying together all region edge probabilities under the 
edge labelling implied by R and F. 



Region Prior Term (0)) encodes the a priori region labelling. This is imple- 
mented using a Markov Random Field (MRF), where the prior probability of a 
region’s labelling depends on its immediate neighbours. Neighbours are consid- 
ered in term of the fractional boundary length such that the more of a region’s 
boundary adjoins foreground regions, the more likely the region is to be fore- 
ground. 

The prior model was estimated from examples of correct region segmenta- 
tions. An asymmetric sigmoid is a good fit to the data, where it is more likely 
to have a promontory of foreground in a sea of background than an inlet in the 
foreground (/ is the percentage of foreground boundary around the region): 

^ l + exp(-10(/-0.4)) 



Solution by Simulated Annealing. In order to minimise over all possible 
region labellings, simulated annealing (SA) is used. This begins with an initial 
guess and then repeatedly tries flipping individual region labels one by one to 
see how the change affects the overall probability. (This is a simple process since 
a single region label change only causes local changes.) 

The annealing process is initialised with a guess based on the edge prob- 
abilities. According to Sec. \Z. II foreground regions are entirely surrounded by 
foreground edges. This can be used as a region-labelling rule, although it is found 
that it works better if slightly diluted to allow for outliers. The initial region la- 
belling labels as foreground any region with more than 85% of its edges having 
a high foreground probability. 

Taking each region in turn0 they are considered both as foreground and as 
background and the probability of each hypothesis is calculated. In each case, the 
prior P (i?) can be calculated by reference to the current labels of its neighbours 
and the evidence calculated from the edge probabilities (using the edge motions 
implied by the neighbouring region labels and the layer ordering). 

^ Each pass of the data labels each region, but the order is shuffled each time to avoid 
systematic errors. 
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In the first pass, the region is then assigned by a Monte Carlo approach, 
i.e. randomly according to the two probabilities. However, the cycle is repeated 
and as the iterations progress, these probabilities are forced to saturate such 
that after around 30 iterations, all regions will be being assigned to their most 
likely motion. The annealing process continues until no changes are observed in 
a complete pass of the data, which takes about 40 iterations. 

The random element in SA enables some local minima to be avoided. How- 
ever, it was found that local minima were still a problem under any reasonable 
cooling timetable and, under some situations, the optimal solution was only 
found around a third of the time. This is solved by repeating the annealing 
process a number of times: 10 maximisations are performed, which gives a 99% 
probability of finding the optimal solution. The entire maximisation of (0 takes 
around 2 seconds on a 300MHz Pentium II. 



Determining Depth Ordering F and Optimal Segmentation R Moving 
between region and edge labels, as in the annealing process, requires the layer 
ordering F to be known. This identifies the occluding edges of regions, and a 
different layer ordering can result in a very different segmentation. The most 
likely ordering, and segmentation, is the one which is most consistent with the 
edge probabilities i.e. the i?, given F, with the highest probability. 

The annealing process is thus performed twice, once for each possible value 
of F, first with motion 1 as foreground and then motion 2 as foreground. The 
segmentation with the greater posterior probability identifies the most likely 
foreground motion and the segmentation. 



3.4 Multiple Frames 

The maximisation outlined in Sects E3 and E3I can be performed over only two 
frames with good results (see, for an example. Fig. |l(c)| ). However, over multiple 
frames more evidence can be accumulated to give a more robust estimate. It is 
always the segmentation of frame 1 that is being maximised, so after comparing 
frame 1 to frame 2, frames 1 and 3 are compared, and then 1 and 4 and so on. 



Initialisation The estimated motions and edge probabilities between frames 1 
and 2, can be used to initialise the EM stage for the next frame. The motion 
estimate is that for the previous frame incremented by the velocity between the 
previous two frames. The edge labelling is initialised to be that implied by the 
region labelling of the previous frame, and the EM begins at the M-stage. 



Combining statistics The probability that an edge obeys motion 1 over n 
frames is the probability that it obeyed motion 1 in each of the n frames. This 
can be calculated from the product of the probabilities for that edge over all 
n frames, if it is assumed that the edge probabilities are independent between 
frames. 
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To perform the region and foreground labelling based on the cumulative 
statistics, the method described in Sec. lO is followed but using the cumulative 
edge statistics rather than those from just one frame. 



Occlusion Over only two frames the problem of occlusion has been ignored 
as it has little effect on the outcome. When tracking over multiple frames, this 
becomes a significant problem. The foreground/background labelling for edges, 
however, allows this problem to be overcome. For each edge labelled as back- 
ground according to the previous frame’s region labelling, the tracking nodes’ 
locations in the current image (under the background motion) are projected back 
into frame 1 under the foreground motion. If they fall into regions currently la- 
belled as foreground, they are marked as occluded and they do not contribute 
to the tracking for that edge. All trackers are also tested to see if they project 
to outside the frame under the current motion and, if so, they are also ignored. 



Segmenting a Sequence The segmentation of an entire sequence may be 
approximated by projecting the foreground regions into the other frames of the 
sequence according to the foreground motion at each frame. These regions may 
then be used as a ‘template’ to cut out the object in each of the subsequent 
frames (see Figs and 0. 

4 Results 

Figure d shows the segmentation from the standard ‘foreman’ sequence based 
on two neighbouring frames. Between frames the head moves a few pixels to 
the left. The first frame is statically segmented (Fig. |l(a)D and then EM run 
between this frame and the next to extract the motion estimates. Figure [T(b)| 
shows the edge labels based on how well they fit each motion after convergence. 
It can be seen that the EM process picks out most of the edges correctly, even 
though the motion is small. The edges on his shoulders are poorly labelled, 
but this is due to the shoulders’ motion being even smaller than that of the 
head. The correct motion is selected as foreground with very high confidence 
(a posterior probability of about 99%) and the final segmentation. Fig. |l(c)| is 
very good despite some poor edge labels. In this case the MRF region prior is 
a great help in producing a plausible segmentation. On a 300MHz Pentium II, 
it takes around 7 seconds to produce the motion segmentation from an initial 
static region segmentation. 

The effect of using multiple frames can be seen in Fig. E] Accumulating the 
edge probabilities over several frames allows random errors to be removed and 
edge probabilities to be reinforced. The larger motions between more widely 
separated frames also removes ambiguity. It can be seen that over time the con- 
sensus among many edges on the shoulders is towards the foreground motion. 
The accumulated edge probabilities have a positive effect on the region segmen- 
tation, which settles down after a few frames to a very accurate solution. If the 
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Fig. 4. Evolution of the ‘foreman’ segmentation, showing the edge probabilities and 
segmentations of frames 47-52 as the evidence is accumulated. The edge probabilities 
become more certain and small errors are removed, resulting in an improved region 
segmentation. 




Fig. 5. Multiple-frame segmentation of the ‘tennis’ sequence. The camera zooms out 
while the arm slowly descends. Shown is the original frame 29 and then the foreground 
segmentation of part of the sequence, showing every 5th frame. The final region la- 
belling is used to segment all frames in a second pass of the data. 



segmentation were continued over a large number of frames then the errors from 
assuming affine motion become would become significant (particularly as the 
foreman tilts his head back and opens his mouth), and the segmentation would 
break down. Dealing with non-affine motions is a significant element planned for 
further work. 

Figures Eland El show some frames from extended sequences segmented using 
this method. In the ‘tennis’ sequence (Fig. EJ) the arm again does not obey the 
affine motion particularly well (and the upper arm and torso hardly obey it at 
all), but is still tracked and segmented well over a short sequence of frames. The 
‘car’ sequence. Fig. El is atypical - it has a large background motion (around 10 
pixels per frame), a hole in the foreground object, and the dominant motion is 
the foreground. However, it is still segmented very cleanly (including the win- 
dow) and the correct motion is identified as foreground. In this case the layer 
ordering is rather unsure over 2 frames (70%/30%), but over many frames the 
edge labellings are reinforced and the final decision is clearly in favour of the cor- 
rect labelling. The affine motion fits the side of the car well over a large number 
of frames. 
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Fig. 6. Multiple-frame segmentation of the ‘car’ sequence. The camera pans to the left 
to track the car. Shown is the original frame 490 and then the foreground segmentation 
of part of the sequence, showing every 5th frame. The final region labelling is used to 
segment all frames in a second pass of the data. 



5 Conclusions and Future Work 

This paper develops and demonstrates a novel Bayesian framework for segment- 
ing a video sequence into foreground and background regions based on tracking 
the edges of an initial region segmentation between frames. It is demonstrated 
that edges can be reliably tracked and labelled between frames of a sequence 
and are sufficient to label regions and the motion ordering. 

The EM algorithm is used to simultaneously estimate the two motions and 
the edge probabilities (which can be robustly estimated using a Markov chain 
along the edge). The correct foreground motion and region labelling can be 
identified by hypothesising and testing to maximise the probability of the model 
given the edge data and a MRF prior. The algorithm runs quickly and the results 
are very good over two frames. Over multiple frames the edge probabilities can 
be accumulated resulting in a very accurate and robust region segmentation. 

The current implementation considers only two layers under affine motions. 
Future work will concentrate on extended multi-frame sequences, since over a 
longer sequence the edge motions cannot be well modelled by an affine mo- 
tion model. Disoccluded edges also appear, and should be incorporated into the 
model. Both problems may be solved by using the tracked edges to assist in the 
resegmentation of future frames in the sequence, which then behave as new ‘key 
frames’ for the segmentation process described in this paper. This would allow 
the system to adapt to non-rigid, non-affine motions over the longer term. 
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Abstract. Segmentation of optical flow fields, estimated by spatio-tem- 
porally adaptive methods, is - under favourable conditions - reliable 
enough to track moving vehicles at intersections without using vehicle or 
road models. Already a single image plane trajectory per lane obtained 
in this manner offers valuable information about where lane markers 
should be searched for. Fitting a hyperbola to an image plane trajectory 
of a vehicle which crosses an intersection thus provides concise geometric 
hints. These allow to separate images of direction indicators and of stop 
marks painted onto the road surface from side marks delimiting a lane. 
Such a ‘lane spine hyperbola’, moreover, facilitates to link side marks 
even across significant gaps in cluttered areas of a complex intersection. 
Data-driven extraction of trajectory information thus facilitates to link 
local spatial descriptions practically across the entire field of view in order 
to create global spatial descriptions. These results are important since 
they allow to extract required information from image sequences of traffic 
scenes without the necessity to obtain a map of the road structure and to 
make this information (interactively) available to a machine- vision-based 
traffic surveillance system. 

The approach is illustrated for different lanes with markings which are 
only a few pixels wide and thus difficult to detect reliably without the 
search area restriction provided by a lane spine hyperbola. So far, the 
authors did not find comparable results in the literature. 



1 Introduction 



Geometric results derived from (model-based) tracking of road vehicles in traffic 
image sequences can already be transformed into conceptual descriptions of road 
traffic. The generation of such descriptions from video recordings of road traffic 
at inner-city intersections - see, e. g., j'2l,3l4ltl ITj - presupposes, however, the 
availability of knowledge about the spatial lane structure of the intersection. 
In addition to knowledge about the geometric arrangement of lanes, knowledge 
about lane attributes is required such as, e. g., which lane might be reserved for 
left or right turning traffic - see Figure Q for illustration. 



D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. 411-g^ 2000. 
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So far, this kind of knowledge had to be provided a-priori by the designer(s) 
of a system, either by a qualitative interactive extraction from image sequence 
data or by digitizing a map of the intersection. Even if a city administration pro- 
vides a map of an intersection comprising all significant lane markings, however, 
experience has shown that such a map may not be up-to-date. 




Fig. 1. The left panel shows a representative frame from a traffic intersection video 
sequence recording traffic from the incoming arm of road A (upper left quadrant) 
through the intersection to the outgoing arm of road B at the bottom. The right panel 
shows another frame from this same sequence, recorded while pedestrians where allowed 
to walk and no vehicles happen to be in the field of view of the recording video camera. 
Lane markings have been extracted from this frame. 



This state of affairs naturally suggests an attempt to automatically extract 
the lane structure of an intersection from the video sequence recording the road 
traffic to be analyzed. The next section outlines our approach, followed by a 
more detailed description in Sections El and0, illustrated by experimental results 
obtained by an implementation of this approach. Additional experimental results 
are presented in Section 0 A concise overview of relevant publications is followed 
by a comparison with our approach and by conclusions in Section 0 

2 Basic Assumptions and Outline of the Approach 

Experience with data-driven image segmentation approaches - regardless whe- 
ther edge- or region-oriented - has shown that success at an affordable compu- 
tational expense depends critically on the exploitation of appropriate implicit 
knowledge about the depicted scene, including its illumination, and about the 
imaging conditions. We venture that progress results if such implicit knowledge 
can be explicated and thus made amenable to scrutiny, a precondition for further 
improvement. 
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This contribution assumes that vehicles which approach, cross, and leave 
an intersection stay predominantly within their lane. Image plane trajectories of 
vehicles thus provide crucial information about the number and geometry of lanes 
as well as about admitted driving directions. The image region corresponding to 
a lane can thus be conceived to be a ribbon of approximately constant width. 
The vehicle trajectory constitutes the ‘spine’ of such a ‘lane ribbon’. 

Lanes around an innercity intersection are composed in general of three seg- 
ments: a straight ‘approaching’ arm, a straight-line or circular-arc segment across 
the intersection proper, and a straight ‘leaving’ arm. As will be shown, these 
characteristics can be captured astonishingly well and concisely by one arm of a 
hyperbola. The assumed ‘hyperbolic spine’ of a lane thus constitutes a powerful 
cue for lane interpolation across the intersection area. 

Direction markers painted onto the road surface are expected near the spine 
of a lane ribbon and parallel to it whereas stopping lines cut nearly perpendicular 
across the lane ribbon. Sidemarks delimit the lane along the ribbon borders in 
the approaching and leaving arm segments, but are often omitted within the 
intersection area itself. All lane markers are assumed to be elongated bright 
blobs surrounded by dark road surface. 

These quite general - but nevertheless essentially appropriate - assumptions 
can be exploited to the extent that we succeed to extract vehicle trajectories 
in the image plane without introducing knowledge about position, orientation, 
shape, and motion of vehicles in the scene. Instead, we assume that vehicles 
can be represented as blobs with sufficient spatiotemporal contrast - i. e. not 
necessarily purely spatial contrast - to segment them from the remainder of the 
imaged scene. We assume, moreover, that such ‘object image candidates (OICs)’ 
move smoothly in the image plane. 

The transformation of these assumptions into an algorithm for the extraction 
of lane descriptions from intersection traffic image sequences will be treated in 
more detail in the next section. 

3 Extraction of Lane Spines from Image Sequences 

Automatic machine-vision-based traffic surveillance at an intersection avoids se- 
veral complications if a single camera records the relevant intersection area (no 
tracking beyond the field of view of a camera into that of another, more simple 
camera calibration, no inter-camera correspondence problems, etc.). The price 
to be paid for this simplification consists in rather small image structures which 
have to be detected, tracked, and classified, since a large field of view at a given 
resolution results in small object images. The following discussion treats only 
the evaluation of monocular image sequences. 

3.1 Detecting and Tracking the Image of a Moving Road Vehicle 

If a vehicle trajectory is expected to provide the information about where in the 
image one should search for lane markings, then this trajectory must be extrac- 
ted with a minimum of a-priori knowledge. Detection and tracking of moving 
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(a) 




(b) (c) (d) 



Fig. 2. The leftmost panel (a) shows an enlarged section around a vehicle to be 
tracked, cropped from the 375th frame of the sequence illnstrated in Fignre 0 Panel 
(b) shows the accepted OF-vectors overlayed to this enlarged section whereas the OIC- 
mask generated for the first subsequence ending with this frame is shown as an overlay 
in panel (c). Panel (d) shows OlC-mask contours obtained for this vehicle, overlayed to 
the 375th frame of this sequence. The OlC-mask position has been advanced from frame 
m to m+1 by the average optical flow vector u™ obtained from the OF-blob determined 
for this vehicle in frame m. The centers of these OlC-masks form a trajectory obtained 
by a purely data-driven approach. 



vehicles will be investigated by estimation and segmentation of densely popula- 
ted Optical Flow (OF) fields, despite the considerable computational expenses 
involved: compared with, e. g., change detection approaches, OF-field segments 
provide immediately usable information about magnitude and direction of the 
shift velocity of greyvalue structures. In many cases, such information allows 
to exclude alternative interpretations of results. If executed properly, moreover, 
the estimation of OF-fields provides additional information about the reliability 
of an estimation and, thereby, further supports the algorithmic analysis of any 
result in doubt. 

In view of the rather small image structures to be tracked, we adopted - albeit 
in a modified manner - a recently published approach towards OF-estimation 
m optical flow is computed by determination of the eigenvector corresponding 
to the smallest eigenvalue of the so-called ‘Greyvalue Local Structure Tensor’ 
a weighted average (over a local environment of the current pixel) 
of the outer product ofVg = where Vg denotes the gradient of 

the greyvalue function g(x,y,t) with respect to image plane coordinates {x,y) 
and time t. Spatio-temporal adaptation of the filter masks for gradient estima- 
tion improves the trade-off between noise reduction and separation of different 
greyvalue structures. 

A 4-connected region of OF-estimates is then selected as an ‘OF-blob’, pro- 
vided at each pixel position within such a region 

1. the OF-magnitude exceeds a minimum threshold (separation between sta- 
tionary background and moving vehicles), 

2. the smallest eigenvalue of Vg(Vg)"^ is smaller than a threshold (i. e. the 
greyvalue structure remains essentially constant in the OF-direction), and 

3. the two larger eigenvalues of Vg(Vg)^ both exceed a minimum threshold 
(i. e. the greyvalue variation in both spatial directions is sufficient to reliably 
estimate an OF-vector). 
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OF-blobs which can be persistently detected at mutually (according to the 
OF-estimate) compatible image plane locations are aggregated into an ‘Object 
Image Candidate (OIC)’ as follows. We assume that the shape of a vehicle’s 
image does not change significantly during a subsequence of n (~ 12) frames. 
Let Ufe denote the average OF-vector of the fc-th OF-blob within a subsequence 
(fc = 1, 2, . . . , n — 1). The fc-th OF-blob is shifted forward by X) W (with 
components of the resulting sum vector rounded to the next integer value) and 
‘stacked’ on top of the OF-blob extracted from the last frame within this subse- 
quence. Among the pixel locations supporting this stack, we retain as an initial 
OlC-mask only those which are covered by at least p % (with, e. g., p = 45) 
of the n possible entries from the stacked OF-blobs. The OlC-contour is then 
taken as a rough estimate for the image shape of a moving vehicle. 

For the first subsequence, is accepted in this form. In the case 

of later subsequences, however, consecutive OlC-masks are merged in order to 
adapt an OlC-mask to possible changes in appearance of an object image. An 
OlC-mask OICi_i obtained from the (i— l)-th subsequence is shifted by upTij^-l- 
location of OICi,„,^j^j obtained from the next (f-th) subsequence. 
The two OlC-masks are stacked, each with the weight (i. e. hit count) determined 
at its generation. Analogously to the generation of an initial OlC-mask, all pixel 
locations are retained which received a weight of at least p % of the 2n possible 
entries from the two stacked OlC-masks. 

Figure El illustrates the generation of an OlC-mask. 



3.2 Extraction of the ‘Lane Spine’ 




Fig. 3. Left panel: The OlC-mask contours, shown every 20 half-frames, with small 
dark ‘x’ denoting the center of an OIC-Mask. Right panel: The ‘lane spine’, a hyperbola 
fitted according to equ.Q- using every half-frame - to the OlC-mask trajectory shown 
in the left panel. 
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As mentioned in Section image plane trajectories of vehicles crossing the 
intersection illustrated in Figure ^can in good approximation be described by 
a hyperbola. In analogy with established practice for ellipses we fit the 
following polynomial 

Ax^ + Bxy + Cy^ + Dx + Ey + F = 0 (1) 

to a sequence of OlC-mask centers such as shown in Figure 01 The free para- 
meter in this homogeneous equation is fixed by the requirement A — C = 1, 
in analogy to the well-known trace requirement A + C = 1 for ellipse fitting 
(see, e. g., m- The degenerate case of a straight-line trajectory is automati- 
cally detected and treated separately. The axes of the right-handed eigenvector 
coordinate system of the resulting conic are fixed by requiring that the apex of 
the hyperbola intersects the first axis and the estimated vehicle trajectory opens 
into the positive direction of this axis. 

4 Extraction of Lane Structures from Image Sequences 

In innercity intersections such as those shown in Figure ^ lanes are in general 
delimited at the sides by bright lines, by sidewalks with a visible curb, or by a 
surface with color or texture visibly different from the lane surface. We generally 
expect, therefore, to find edge elements marking the side of a lane image. The 
problem consists in the challenge to reliably detect a sufficiently large fraction 
of these edge elements in order to facilitate their concatenation into a coherent 
lane delimiter. At this point, we introduce a-priori knowledge about (continuous 
or interrupted) lane side boundary markings in the form of a ‘lane model’. In 
principle, one could define such a model in the (2-D) image plane or in the 
(3-D) scene. We decided to define the model in the scene, because the depen- 
dencies between the ‘lane spine’ and the hyperbolas representing the lateral lane 
boundaries are more simply described in the scene. If we know the projective 
transformation by the camera system, a lane representation in the scene domain 
can be transformed without any further heuristics into the image plane. Our 
lane model will be denoted as a ‘hyperbolic ribbon’. This hyperbolic ribbon will 
be fitted to the edge elements tentatively selected as marking the side boundary 
of a lane image. 



4.1 Hyperbolic Ribbon as Lane Model 

As mentioned in section |3 the spine of curved intersection lanes can in good 
approximation be described by a hyperbola. Based on such a ‘hyperbolic spine’, 
we are able to construct a ribbon of hyperbolas to describe the lateral delimiters 
of a curved lane. A lane has one delimiter at the right and one at the left side. So 
it seems to be enough to construct a ribbon of three hyperbolas. One hyperbola 
can be defined by five parameters [mx,rny,9,a,b]. The parameters rrix and my 
represent the center point, 8 describes the orientation and a, b specify the form 



Data-Driven Extraction of Curved Intersection Lanemarks 



417 





Fig. 4. The left panel depicts the ‘lane spine’ G as shown in the right panel of FigureEl 
together with the two ‘lane boundary’ hyperbolas H, H' . These three hyperbolas repre- 
sent a lane as a ‘hyperbolic ribbon’. The ‘leaving’ arm is wider than the ‘approaching’ 
arm of the lane, but the asymptotes of all hyperbolas are parallel. The coordinate 
system shown refers to the lane spine G, i. e. mc^ = moy ~ 0 and 6g = 0. The 
right panel illustrates how to construct one side of a hyperbolic ribbon with different 
widths. The center point mjfi of the hyperbola H’ can be constructed with the norma- 
lized directions of the asymptotes Ti and T 2 , the angle a = tt — 2ri, and a scale factor 
based on the widths di, d 2 - The direction and orientation of the hyperbola H’ are the 
same, respectively, as those of the lane spine G, i. e. the normalized directions of the 
asymptotes T[, T 2 are the same as the normalized directions of Ti, T 2 - The position of 
the apex s^/ is defined by the distance between the apices sc and Su'- 



of a hyperbola. A ribbon of three - nominally unrelated - hyperbolas therefore 
requires the specification of 15 parameters. 

The outer hyperbolas can be derived from the hyperbolic spine inside the 
ribbon with a few basic and useful assumptions about the symmetry of a lane. 
These assumptions, based on official guidelines for constructing innercity inters- 
ections, lead to a large reduction of the number of parameters while defining a 
hyperbolic ribbon (see Figure^. All of the following definitions are related to 
the eigensystem of the lane spine G which is shifted by {mx,my)’^ and rotated 
by 6 relative to the coordinate system of the scene, according to the parameters 
[mx,my, 6, a, b] of lane spine G: 

— Orientation: All hyperbolas have the same orientation 9. This reduces the 
number of parameters from 15 by two to 13. 

— Shape: Each hyperbola has a pair of asymptotes. These asymptotes repre- 
sent the straight parts of a lane. We assume, therefore, that all aperture 
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angles have the same value rj with tan(?7) = h/a. This reduces the number of 
parameters by two more to 11. 

— Position: The ‘approaching’ arm and the ‘leaving’ arm of a lane at an in- 
nercity intersection can differ in their width. The parameters di and (I 2 take 
this observation into account. Figure 0 shows how the center point of the 
hyperbola H’ is moved. It can be computed from the center point of G as 
follows: 

/ (di +d2)/(2sin(7?))\ 

rriH' = . (2) 

\(di - d 2 )/( 2 cos{r]))J 



The position of the center point of the hyperbola H can be computed ana- 
logously: 

/ {di + d 2 )/{ 2 sm{r]))\ 

mn = - , (3) 

\(di -d2)/(2cos(77))/ 



Equations 13 and 0 can be derived based on the geometric context shown in 
the right panel of Figure 0 This reduces the number of parameters further 
by two parameters to 9. 

— Apex: The last free parameter defines the position of the apex. This position 
specifies, too, the acuity of the hyperbola at this point, because the center 
point, orientation and aperture angle are fixed. A hyperbola G in the normal 
form 



2 2 
_ _ IG = 1 
a 2 52 “ 



( 4 ) 



has its apex at (a, 0)^. Multiplying this vector by a factor p moves the apex 
along the first axis: (p-a, 0)^. At the same time, the hyperbolas have to retain 
their shape. This can be achieved by multiplying the shape parameter b with 
the same factor p. The aperture angle 77 — where tan{r]) = {p-b)/{p-a) = b/a 
— will thus remain unchanged. The distance between the apices S}ji and sq 
as well as between sh and sq is defined by 



IIsh-sgII 






d\ + C?2 
2 



( 5 ) 



Usually the width of the curved section of a lane is greater than in the straight 
lane sections. This can be taken into account by setting r to values greater 
than 1. The factors pn and ph', referring to the lane boundary hyperbolas 
H and H’, respectively, can be computed by solving: 



Ph ■ a = a 



di d2 
2sm{r]) 



fdi + d2\ / (^2 — di \ 

\ 2 ) \2cos{r])) 



( 6 ) 



_ di -I- c?2 If di -I- d2 \ / d2 — di \ 

^"'■““““2sin(77)^Y V 2 ) ~ V2cos(t7) ) 



( 7 ) 



Using these assumptions, the initial number of 15 parameters can be reduced to 
only 8 parameters for describing a complete hyperbolic ribbon: 
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< (ma:,my)'^,0c,aG,bG,di,d2,T > 

where the index G refers to the lane spine G. 

This model approximates a lateral lane delimiter by an infinitely thin line. 
In order to take the finite width of a lane delimiter into account, we extend the 
model: we use a hyperbolic ribbon for each lane delimiter, this time parame- 
terized for the width dw of the lane delimiter. The complete lane model can 
thus be described by the tuple < {mx,my)G,9G,aG,bG,di,d2,T,dwi,dw2 > of 
parameters. The parameters dwi , dw 2 represent the width of the left and right 
lane boundary delimiter, respectively. Two separate parameters are necessary in 
general since lane delimiters between adjacent lanes may differ in width from 
those delimiting the road sides. 

4.2 Fitting Hyperbolic Ribbons to Edge Elements 




Fig. 5. The hyperbolas Hi and H 2 model one lateral boundary of a lane. All edge 
elements inside the hyperbolic tolerance ribbon limited by the hyperbolas B; and 
are candidates for the fitting step. An edge element has a position and a direction, so 
we can interpret an edge element as a line. As shown in the lower part of this Figure, 
the intersection of this line with the hyperbolas B; and B,- results in the parameters 
Xi and Ar. If Ai < 0 < Ar, the edge element inside the hyperbolic ribbon limited by B; 
and Br will be fitted to the hyperbola Hi, otherwise to H 2 - The edge element shown 
at position (u, in the upper part of this Figure has the distance dui from Hi. This 
distance measure takes into account the gradient orientation at the point {u,v)'^. If 
we consider an edge element as a line, this distance can be derived by intersecting this 
line with the hyperbola Hi. 



Fitting hyperbolic ribbons to edge elements is based on the approach of m- 
Although the lane model is defined in the scene, the projected image of this model 
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is fitted to the edge elements in the image: the sum of Mahalanobis distances 
between edge elements and the projected model - parameterized according to the 
actual parameter estimates (which constitute the components of the filter state 
vector) - is iteratively optimized. The finally accepted state vector represents 
the matched lane in the scene. 

The estimation process is initialized by the lane spine mentioned above. The 
widths of the arms of a lane are initially set to the same standard value. The 
typical lane width, the typical width at the curved section of a lane, and the 
typical width of the lateral lane boundary delimiters can be found in guidelines 
for constructing innercity intersections. All parameters can thus be initialized 
by reasonable values. 

The distance function takes the gradient orientation of an edge element into 
account (see Figure EJ. Let an edge element e = {u,v,(j))'^ and the hyperbola H 
be given by 



H : Ax^ + Bxy + Cy^ + Dx + Ey + F = 0 (9) 



The signed distance function x) which quantifies the distance between an 

edge element e and the hyperbola H under the actual state x (with x = 
{mxc,myi^,9G, ac, bo, di, c? 2 , t, dwi,dw 2 )'^) is defined by: 



dH(e,a:) 

with 

Ai 

r 

P 

q 



f Ai , |Ai| < IA 2 I 
A 2 , otherwise 

-p/2 + y^p'^/A-qr -p/2 - sJp^/A-qr 

, A2 — 

r r 

Acos^{(j)) + Bcos{(j)) sin((/) + C sin^(</) , 

2 Acos{(j))u + B{cos{(j))v + sin((/)u) + 

+ 2Csin(^)r; + Dcos{(j)) + Fsin(^), 

Au^ + Buv + Cv^ + Du + Ev + E . 



Note that a negative distance is possible. 

The lane is modeled with two lane delimiters. Each of them consists of two 
outer hyperbolas Hi, H 2 (the hyperbolic spine is only used for the construction 
of the hyperbolic ribbon, but not for the computation of the distance between 
edge elements and the projection of the lane model). Fitting all edge elements of 
the entire image to a lane model will consume too much time. It is necessary, the- 
refore, to decide which edge element should be associated with which hyperbola. 
The first constraint is satisfied by building a hyperbolic tolerance ribbon around 
each of the delimiters (see Figure ED. All edge elements inside such a hyperbolic 
tolerance ribbon between the hyperbolas Bi and Br are considered candidates 
for the fitting process. The association of an edge element with the correct hy- 
perbola {Hi or H 2 ) is based on the typical greyvalue distribution near a lane 
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delimiter since delimiters are in general brighter than the rest of the lane. An 
edge element represents this property by the gradient direction at this position: 
this direction has to be approximately perpendicular to the assigned hyperbola. 
It is easy to decide whether an edge element has to be associated with the left or 
the right hyperbola of the delimiter: An edge element considered as a directed 
line can be intersected with the boundaries of the hyperbolic tolerance ribbon. 
Let X[ = d,Bi{e,x) and = dB^{e,x) denote the distances of the edge element 
to the intersections of this line with the hyperbolas Bi and S,-. If A; < 0 < A^, 
the edge element will be associated with hyperbola Hi, otherwise with hyperbola 
i ?2 (see Figure E|). 



5 Experimental Results 

Figure Elshows two successfully detected lanes with fitted models overlayed. The 
fitting process describes the lane delimiters essentially correct. A small discre- 
pancy in the curved section of the lane shown in the left panel corresponds to 
the variation of the lane width as described above. 

Both lanes differ in width between their approaching (2.85 m) and their lea- 
ving (3.50 m) arms. In addition, the widths at the stop line is a little bit smaller 
(2.75 m) than in the incoming straight sections of both arms. The lane spine 
derived for the right lane was a bit off the lane center since the driver of the 
respective car did not drive strictly along the lane center. The starting state 
for the right boundary delimiter thus was initialized too far to the right. The 
fitting process could not fully correct this state initialization. A possible solution 
for such a problem could consist in exploiting a longer observation of the scene: 
rather than relying on the first trajectory in order to derive a lane spine, one 
could select a more appropriate one, based on a statistical analysis of several 
trajectories. 

The left panel of Figure 0 shows all hyperbolas used in fitting a lane. The 
enlarged sections allow to verify the quality of fit in more detail. 

The approach has been extended in order to treat neighboring lanes with 
essentially the same parameter set as illustrated by the right panel of Figure Q 
assuming equal widths and orientation for both lanes and symmetrical apex dis- 
placements for boundary spines, we only need one additional parameter, namely 
for the width of the common boundary delimiter separating the two neighboring 
lanes. In this manner, we are able to fit six hyperbolas - which can differ sub- 
stantially with respect to their position and orientation in the image plane - to 
about 4900 edge elements, using only 11 parameters. A backprojection of lane 
delimiters extracted in this manner into the road plane and a comparison with 
official map information about lane markings at the depicted intersection yields 
good agreement (see Figure |HI). 

The approach reported up to here has been extended even further in order to 
cope with situations such as those illustrated by the left upper panel in Figure 0 
since this sequence comprises only fifty frames, no vehicle trajectory covers a 
lane across the entire intersection. We thus fit multiple trajectory segments (see 
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Fig. 6. The left panel shows a fit of a hyperbolic ribbon to the left lane, computed 
according to the approach described in Section^ The hyperbolic spine is drawn in blue, 
the hyperbolas in the centre of the lane delimiters are painted red, while the hyperbolas 
which model the boundaries of the lane delimiters are painted yellow. All edge elements 
which are taken into account during the fitting process are painted in blue, too. Within 
the upper enlarged subwindow, it can be clearly seen that edge elements due to the 
large traffic sign which occludes part of this lane do not endanger an appropriate fit. 
The right panel shows a fit of a hyperbolic ribbon to the right lane, in analogy to the 
left panel. 




Fig. 7. The left panel shows all hyperbolas exploited for the fit to the left lane, 
together with an enlarged section. The black hyperbolas bound the area in which edge 
elements are supposed to belong to the enclosed lane delimiter. One can clearly detect 
the differently (colored in either pink or green) oriented edge elements on the sides 
of bright blobs corresponding to short lane markings. (Notice the effect of grey value 
overshoot in the scanline direction for a transition between the bright lane markings 
and the dark background which results in unexpected additional edge elements colored 
in green, clearly visible in the insets!) The right panel shows a simultaneous fit of two 
hyperbolic lane models side-by-side (using a joint parameter set, thus significantly 
reducing the number of parameters required to describe both lanes) to two neighboring 
lanes. 
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Fig. 8. This image shows the backprojections of two simultaneously fitted lanes (see 
Figure Q into a map of the intersection provided by the official land registry. The 
comparatively small deviations in the lower part of the image are attributed to small 
systematic errors of the camera calibration. As Figure Q shows, the lanes were fitted 
well in the image plane both for the incoming and the leaving arm. It can be seen that 
the incoming lanes become smaller close to the stopping line in order to allow for an 
additional ‘bicycle lane’ for those bicyclists who want to continue straight ahead. 



the top right panel in Figure 0 from different vehicles in neighboring lanes to 
a jointly parameterized pair of two neighboring ‘lane spine hyperbolas’. The 
initial estimates for these lane spine hyperbolas are shown in the lower left panel 
and the final result for the boundary limits in the lower right panel of Figure El 
The remaining deficiency regarding the ‘leaving’ arm is due to the lack of an 
appropriate initialization and of sufficient contrast to facilitate a correction in 
the estimation of lane boundary delimiters. This Figure illustrates both the 
power and current limits of our approach. 

6 Comparison and Conclusions 

The idea to exploit data-driven tracking of moving objects in video sequences 
in order to derive descriptions of developments in the scene has found increa- 
sing interest recently, due to methodological improvements in the detection and 
tracking of moving objects. The continuous decrease of the price/performance 
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(c) (d) 



Fig. 9. The upper left panel (a) shows an overview of another traffic intersection. In 
the upper right panel (b), object contonrs of all vehicles were overlayed. This variation 
of the basic method was nsed due to the shortness - 100 halfframe - of the video 
sequence. The starting state of the fitting process based on the object contour centroids 
is shown in panel (c). Despite of the short sequence, the model of two lanes could be 
fit successfully to the visible lane delimiters (see panel (d)). The quite bad result in 
the upper left quadrant of the lower right panel (d) is due to missing tracking data - 
and thus missing object contour centroids - in this area. 
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ratio for computers has significantly accelerated research in this area. Recent 
investigations have concentrated on tracking moving bodies in order to extract 
significant temporal events (see, e. g., 0,0) or to build coarse scene models in 
addition, but based on multiple calibrated cameras |7j. Others use data-driven 
tracking in order to recover 3D trajectories under the ‘ground plane constraint’ 
and exploit the knowledge acquired thereby to control an active camera platform 
in order to keep a moving body within the field of view of the camera set-up 0 . 

Lane finding is an important subtask for machine- vision-based control of road 
vehicles. In such a context, execution time is at a premium, with the effect that 
preference is given to computationally simple algorithms. As long as a vehicle 
guided by machine-vision just has to follow an (at most slowly curving) road 
or to continue straight ahead across an intersection, lane boundaries can be ap- 
proximated sufficiently well by low-order polynomials of the horizontal image 
coordinate as studied, for example, by |S|. In ra, similar simple road configura- 
tions are investigated for lane keeping purposes by machine- vision. Although 0 
addresses this same problem of lane keeping, its author analyzed the image of a 
slowly curving lane recorded by a video-camera mounted behind the windshield 
of a driving car and concluded that it can be well approximated in the image 
plane as a hyperbola - essentially due to the effect of perspective projection 
under the conditions mentioned. In our case, the lane spine is modeled by a 
hyperbola in the scene: the straight line sections enclosing the curved section 
are due to the lane structure across an intersection in the scene - as opposed 
to be due to a perspective effect in the image plane associated with a constant 
curvature lane in the scene. Since we exploit the hyperbolic lane structure in 
the scene in order to reduce the number of parameters to be estimated for lane 
boundary delimiters and neighboring lanes, our approach turns out to be able 
to cope successfully with a considerable number of potentially distractive edge 
elements. 

So far, we did not encounter an example where a vehicle trajectory has been 
used to extract a global quantitative description of multiple lane borders in an 
image sequence of a nontrivial traffic scene. 

We exploit image-plane trajectories of vehicles in order to collect evidence in 
the image plane about the exact position of side marks in the form of edge seg- 
ments in very restricted image regions, thereby significantly reducing the danger 
of picking up unwanted edge elements or edge segments. Fitting an hyperbola 
to vehicle trajectories enables us to interpolate the (frequently ‘invisible’) part 
of the lane within an intersection area. Local spatial descriptors can thereby be 
linked along the vehicle trajectory into the remainder of the field of view, thus 
establishing global spatial descriptors. An added attraction of our approach is 
seen in the fact that the transformation of hyperbolic curves under perspective 
projection can be studied in closed form. This should facilitate investigations 
regarding a quantitative transfer of spatial descriptions within an image into 
spatial scene descriptions. 

Future research will not only be devoted to increase the robustness of the 
approach reported here, but also to develop estimation procedures for initializa- 
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tion parameters which have so far been set interactively, such as the initial width 
for a lane. The accumulation of information about lanes from multiple vehicle 
trajectories certainly belongs into this category, too, picking up ideas reported 
by, e. g., US). 

Robust automatic extraction of lane boundaries should facilitate the detec- 
tion and classification of additional road markers painted onto a lane. Such a 
capability allows to determine the ‘semantics’ of such a lane, for example that 
it constitutes a lane reserved for left turning traffic - for example, see Figure 0 
This information will significantly simplify to characterize traffic behavior at the 
conceptual or even natural language level of description. 
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Abstract. Tracking and characterizing convective clouds from meteoro- 
logical satellite images enable to evaluate the potential occurring of 
strong precipitation. We propose an original two-step tracking method 
based on the Level Set approach which can efficiently cope with fre- 
quent splitting or merging phases undergone by such highly deformable 
structures. The first step exploits a 2D motion field, and acts as a predic- 
tion step. The second step can produce, by comparing local and global 
photometric information, appropriate expansion or contraction forces on 
the evolving contours to accurately locate the cloud cells of interest. 
The characterization of the tracked clouds relies on both 2D local mo- 
tion divergence information and temporal variations of temperature. It 
is formulated as a contextual statistical labeling problem involving three 
classes “growing activity” , “declining activity” and “inactivity” . 



1 Introduction 

The study of the life cycle of strong convective clouds (CC) is an important 
issue in the meteorological field. Indeed, such cold clouds often convey hard 
weather situations as pouring rains or even tornadoes. We aim at providing 
forecasters with new and efficient image processing tools in that context. We 
have addressed two major issues: tracking of cold cloud cells and characterization 
of their convective activity. To this end, we have developed two original methods 
exploiting both motion and photometric information. These methods can also 
be of interest beyond the meteorological domain. 

Preliminary studies in the meteorological domain have already considered 
these issues m- Since these meteorological phenomena are present in cold cloud 
systems, detection of strong convective cloud cells first involves a low tempera- 
ture thresholding step in infrared images. In m, relevant cells are then isolated 
according to spatial properties (ellipticity factor, distribution of the spatial gra- 
dient of temperature, minimal area,...), and tracking simply results from the 
overlap between the prediction of the position of a cell detected at time t — 1, 
using the previously estimated displacement of its gravity center, and a cell 
extracted at time r. 

D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. 428-El 2000. 
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A commonly adopted strategy in computer vision to extract and to track 
complex objects in an image sequence is to exploit deformable models such as 
active contours [4I9| . Starting from an initial position, and using external for- 
ces and internal constraints, the contour shape is modified toward the desired 
solution. However, results are highly sensitive to the initial conditions, and the 
considered scheme usually prevents from handling shape with significant protru- 
sions. Moreover, topological transformations of the silhouette shape, as merging 
and splitting of parts, cannot be properly coped with. Different extensions to 
the active contour techniques have been developed to alleviate these drawbacks, 
such as introducing particle system ini, exploiting so-called “pedal” curves HH|, 
taking into account region-based informations fTTlKn] . However, these shortco- 
mings associated to active contours have been elegantly and efficiently overcome 
by the Level Set approach introduced by Sethian and Osher fTWH] , In this ma- 
thematical framework, the curve evolution is described through the evolution 
of an implicit higher-dimensional scalar function. The curve evolution is now 
described in a fixed coordinate system (Eulerian description) enabling the hand- 
ling of topological changes. Such an approach or a related formalism has been 
already exploited in meteorological applications mm- The tracking of large 
cloud structures is achieved in [2| following the ideas proposed in El to recover 
the minimal paths over a 3D surface. This method however requires to previously 
extract the cloud boundaries in the successive considered images. In m , a par- 
ticle system El is exploited and embedded in an implicit surface formulation. In 
0, regions corresponding to convective clouds are extracted by first introducing 
posterior probabilities associated to the different cloud types. The curves then 
grow up from user-defined “seed points” to the salient contour shapes. 

To make easier the localization of the curve in the next image of the se- 
quence, it seems relevant to exploit motion-based information. The integration 
of dynamic information has thus been proposed in |B1 and quite recently in El 
II b| by adding a motion-based term in the propagation function. Nevertheless, 
these last methods consider parametric motion models (i.e. 2D affine motion mo- 
dels) which are inappropriate in case of highly deformable structures present in 
fluid motion such as clouds. In that context, dense motion fields are required to 
describe the non-linear nature of the cloud motions. Besides, in , motion 

information is in fact introduced to perform motion segmentation. 

During its life cycle (growth, stability and decline), a convective cloud cell 
is likely to undergo different changes of topology such as merging with other 
neighbouring cells or splitting. Indeed, it seems quite appropriate to follow the 
Level Set formulation to detect and track these clouds in meteorological satel- 
lite images. We propose an original two-stage Level Set method to handle this 
tracking issue. It introduces the use of dense motion information in a first step 
acting as a prediction stage. Then, the accurate location of the cloud is achieved 
in a second step by comparing the local intensity values to an appropriately 
estimated global temperature parameter representative of the tracked cloud cell. 
This step can generate appropriate expansion or contraction forces of the evol- 
ving contours to localize the boundaries of the cold clouds of interest. This is of 
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primary importance since the predicted position of the cloud cells usually over- 
lap the real ones. Then, to characterize the convective activity of these clouds, 
we consider the joint evaluation of the local divergence information contained in 
the 2D estimated velocity field, and of the temporal variations of temperature 
of the cloud cell points. This leads to qualify the convective activity level of the 
cloud cell corresponding to its degree of vertical evolution. The characterization 
stage is formulated as a contextual statistical labeling problem involving three 
classes: ‘“growing activity”, “declining activity” and “inactivity”. 

The sequel of this paper is organized as follows. Section 2 outlines the main 
aspects of the Level Set formulation. Section 3 briefly describes how the cold 
clouds are primarily detected. In Section 4, we describe our Level Set-based me- 
thod to track these cold cloud cells. Section 5 deals with the characterization 
of the convective cells. The efficiency and accuracy of the proposed scheme is 
demonstrated in Section 6 with results obtained on numerous difficult meteoro- 
logical situations. Section 7 contains concluding remarks. 

2 Level Set Formulation 

We briefly recall the main aspects of the level set formalism m- Let T]^{s,to) be 
a set of N closed initial curves in with i G [1,1V]. An implicit representation 
of these curves is provided by the zero-level set of a scalar function 'ip, defined by 
z = ip{X{s),to) = ±d, where d is the minimal signed distance from the image 
point, represented by vector X(s) = [a;(s), y(s)]^, to the curves (the con- 

vention is plus sign for a location outside the set of curves ry*(to)). In our case, 
function 'ip corresponds to a 3D surface F. t), i = 1, N} is the family of cur- 
ves generated by the successive zero-level sets of the surface 'tp{X{s),t) moving 
along its normal directions n = For a given level set of ip, ip{X{s),t) = C, 
the speed function F at position X(s) represents the component of the vector 
normal to rf{s,t). Let F = n. Deriving each member of equation 

ip{X{s),t) = C, and using the expressions of n and F, we obtain the Eulerian 
formulation of the evolution equation monitoring the successive positions of the 
surface F, evaluated over a fixed grid: 

iPt + F\ViP\ = 0 (1) 

where Vip = (|y, After each propagation step of the surface F according to 
the speed function F which is given by an iterative numerical resolution scheme, 
the relation ip{X{s),t) = 0 yields the new position of the family of curves t). 

The definition of a particular application based on the Level Set approach 
involves the design of the speed function F. 

3 Early Detection and Initialization 

Before considering the tracking of cold cloud cells, let us briefly mention the early 
detection stage which provides us with the initial positions of curves of interest. 
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This preliminary detection stage consists in a classical temperature thresholding 
of the processed infrared satellite images to extract the colder clouds which 
may contain convective activity. In the very first image of the sequence, the 
initial curves are all given by the contours of each connected set of pixels with 
temperature values lower than a given threshold Ith- In the current image of the 
sequence, this procedure is only valid for newly appearing cold clouds. Indeed, if 
the detected cloud areas are included in already tracked cells, they are removed. 
For the already tracked cells, we consider as initialization the contours obtained 
in the previous image and denoted rjr-i (let us note that t will designate the 
“physical” time attached to the image sequence whereas t will be used in the 
evolution process corresponding to the Level Set formulation). Let us note that 
for convenience temperature and intensity will be assimilated in the sequel (in 
practice, we use calibrated equivalence tables). 

4 Tracking of Cold Cloud Structures 

Solving the tracking issue leads to specify the speed function F introduced in 
equation ^ 

The top of a convective cloud is characterized by a low temperature due to 
its high altitude as a result of vertical displacements, and by spatial intensity 
gradients of rather small magnitude in the heart of the cloud cell but generally 
more important in the vicinity of its boundaries. We exploit this a priori know- 
ledge both in the preliminary detection step providing the initial zero level sets 
as described above, and in the definition of the speed function F. Since we are 
dealing with moving entities, it appears particularly relevant to exploit dyna- 
mic information too. Indeed, predicting the new position of the curves at the 
next instant brings more robustness (by preventing from false pairing) and more 
efficiency (by saving iterations and then computational load). Then, a motion 
estimation step is introduced. We only consider the regions delineated by the 
closed contours as the support of the estimation of the 2D motion field to 
be used. As stressed in the introduction, we need to compute a dense 2D velo- 
city field. To this end, we have adopted a robust incremental estimation method 
described in leading to the minimization of a non convex energy function. 
This energy function involves robust M-estimators applied to the data-driven 
term based on the optical flow constraint equation, and to a regularization term 
preserving motion discontinuities. This method combines a hierarchical multi- 
grid minimization with a multiresolution analysis framework. This last point is 
of key importance to provide accurate estimates in case of cloud displacements 
of large magnitude which are likely to occur in this application. The estimation 
of the 2D apparent motion field within selected areas must not be corrupted by 
the surrounding motions of neighboring lower clouds. We have thus introduced 
an adaptive subdivision scheme of the image, supplying an initial block parti- 
tion close to the selected areas. In order to obtain the final velocity field at full 
resolution, the final size of blocks at full resolution in the minimization process 
is pixelwise. 
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We have designed a function F composed of two distinct components, 
and F 2 , related respectively to dynamic and photometric information. These 
components will act in a sequential way in the evolution of the tracked curves. 



4.1 Dynamic Component Fi 

The first component F\ of the speed function F takes into account motion in- 
formation. It is defined by: 

Flip) = Fai ^{p)- n{p) - €K (2) 

where represents a stopping factor related to the 2D estimated motion field u> 
between time t— 1 and r. Fai is a positive constant greater than one which allows 
us to speed up convergence. The second term depends on the surface curvature 
given by «: = div n. It can be seen as a smoothness term whose influence on the 
evolving curve depends on the value of parameter e. 

F\ component is considered in a first step and then the photometric compo- 
nent F 2 intervenes. F\ makes evolve contours according to the projection of ui 
on n. This component provides a prediction to the photometric tracking step. 
Compared to a classical motion-based curve registration technique, this formula- 
tion allows us to handle in a well-formalized and efficient way problematic events 
such as splitting, merging, crossing of cloud cells. 

The component Tj is of particular importance in case of small cloud cells, 
whose apparent displacement magnitude is larger than the size leading to no 
overlap between two successive positions. Let us mention that their 2D apparent 
motion is also due to the motion of the surrounding medium, which explains 
that we can recover their motion using a multiresolution regularization method. 

The 2D velocity vector uj{p) can be used only on the zero-level set, i.e. on 
the image plane. Therefore, we exploit the geometric Huyghen’s principle: the 
value of w(p) at point p is given by the one at pixel p in the image plane, which 
is the nearest to p. We denote w(p) the velocity vector exploited at point p given 
by the one computed at p. 

Following the same principle, the stopping factor denotes the global “ex- 
tension” of defined over the whole domain of ip. We define it as follows: 

Lip) = L^^Tip) < |w(p)|] (3) 

where 6 is the Kronecker symbol, Adxip) = \^tip)\ and dt(p) is the shift 

vector at pixel p induced by the implicit surface evolution at the iteration 
(for a total of T iterations) . The stopping factor is equal to one when the value of 
Adrip) is lower than |tb(p)| and zero otherwise. The contour is stopped as soon 
as a sufficient number of pixels verify ^c^iip) = 0. This stopping factor expresses 
the fact that the total shift applied to the evolving contour at a given point p 
must be bound by the magnitude of the corresponding estimated velocity vector. 

If the hypersurface F at iteration t — I is the signed distance to the contours 
rjit—1). By using the Huyghen’s principle, updating function 1 /) to give ipit) turns 
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out to be similar for all pixels of the grid belonging to the normal to rj(t — 1 ) 
at pixel p. The intersection of the hypersurface F to the plane {zOn) is indeed a 
straight line of slope unity. Finally, we can write dt{p) = —(^pp{t) — — l))n. 

The contour shift at point p between iteration t — 1 and t is effective if dt{p) > 
since the extraction process providing the current position of the contours after 
each iteration only yields entire coordinates. An example of results is shown on 

Figure 0 

4.2 Photometric Component F 2 

We aim at determining a strategy able to accurately move contours toward the 
real cold cloud boundaries. To this end, we exploit thermal information (i.e., in- 
tensity information in thermal infrared images) over contours. These local tem- 
peratures are compared to a global temperature characteristic of the tracked 
cloud cell at time t. The sign of the difference of these local and global tempe- 
ratures determines the way the contour evolves, i.e. the direction of the applied 
force F 2 at point p. This allows us to explicitly introduce locally contracting 
or expanding evolution of the contour according to the local configuration at 
hand, which is of particular importance since the current and desired positions 
of the curve are supposed to overlap. A somewhat similar flexible mechanism 
but issued from different considerations has also been proposed quite recently in 
0 to extract shapes from background in static images. 

The global characteristic temperature of the cloud cell at time r is estimated 
as follows. It is predicted from dynamic and thermal information obtained from 
the previous time instant. We denote the characteristic temperature associated 
to the contour 77 ^ at time r by 6 *^. 9i is obtained by assuming that the intensity 




Fig. 1. Contour evolution successively monitored by the two components of the speed 
function F and their associated stopping factors. Part of infrared Meteosat image ac- 
quired on August 10, 1995 at 12h30 TU. (a) Initial contours (overprinted in black), (b) 
Contours after the first tracking step involving Fi component, (c) Final contours after 
successively performing the two tracking steps involving respectively Fi component 
and F 2 component. 
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function verifies the following continuity equation of fluid mechanics: 



dl 

— + div(/u;) = 0 

OT 



( 4 ) 



This equation is related to the assumption that expansion or contraction of fluid 
(associated to a dissipation or to a concentration of matter) corresponds to in- 
tensity changes in the image sequence. Recalling that div(/u;) = w.V/ -I- Idiv a;, 
and using the expression of the total derivative of / with respect to time 
^ = §4 _|_ o;.V/, we can rewrite equation P|) as follows: 



dl 

- — h /div w = 0 
dr 



( 5 ) 



Assuming, as in 0, a constant speed over the “particles” trajectories from r — 1 
to r, we can express intensity I at time r at the displaced point p -I- u}(p) by 
integrating both members of equation 0 which leads to: 



I{p + w(p), r) = I{p, r — 1) exp( —div a;(p) 



( 6 ) 



The characteristic temperature 9i of the cell corresponding to contour ry* can be 
given by the mean of I{p + <-^{p), t) evaluated over region delineated by 

the contour 



X! 4(p, r - 1) exp( -div u>{p) 



N, 






( 7 ) 



where Ni is the number of pixels in 

We need to compute the divergence of the 2D estimated motion held. It 
is expressed by div uj{p) = where a)(p) = [u{p),v{p]]'^ is the 

velocity vector at pixel p. It is derived from the estimated motion held by using 
appropriate derivative Alters. 

Solving equation (^3) leads to move the set of initial curves toward the new 
positions of cloud cells. We have designed the following expression of the speed 
function F 2 composed of a curvature term and a so-called advection term: 



advection term Fad„(p) 



F 2 {p) = ii{p) Fa 2 sign! 0i - I{p)] -€K 



( 8 ) 



where Fa 2 , 9% and I respectively denote the constant magnitude of the advection 
force, the estimated characteristic temperature of the convective cloud cell i, and 
the intensity function in the infrared satellite image (intensity / (p) here accounts 
for temperature). 

As already mentioned, the definition of the advection term of F 2 allows us to 
deal with a force either of contraction or of expansion depending on the intensity 
value I{p). This is further explained and emphasized below. 
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We need to exploit relevant image-based information to stop the evolution 
of the curves at the real boundaries of the cloud cells. To this end, we have to 
define a weighting factor in the speed function T 2 , i-e. an image-based factor 
which will play the role of stopping criterion. The global “extension” of ^ 7 , 
denoting ^ 7 , can be written as: 



where <5(x) = 0 if a; = 0, <5(a:) = 1 otherwise, function gi{p) is given by 
9i(p) = (i+|vg!*j(p)|)^ KdviP) and F^-^ip) respectively denote the advec- 
tion term computed at times t and t — 1. S/Ga * I{p) represents the convolution 
of the image with a Gaussian smoothing filter. When the evolving contours are 
located within warm areas (i.e. I{p) > 9i), they undergo a contraction force 
which moves them toward a cold cloud cell boundary. After a boundary of a 
cloud cell is crossed and the curve point is within the cloud cell, I{p) becomes 
lower than 9i. This induces a change of the sign of the advection term, and thus 
defines an expansion force. Since becomes equal to 0, the value 

of ^i{p) is then given by gi(p) which can tend to zero if high intensity contrast 
is present at point p. Hence, we have introduced intensity spatial gradient in- 
formation in an appropriate way, i.e. only when the evolving curve lies inside a 
cloud cell of interest. g\ is now moving in the opposite direction, and stops by 
the first encountered contrasted intensity edges. Owing to the proposed scheme, 
evolving curves thus cannot be attracted by intensity edges belonging to non 
relevant clouds or to other visible structures in the image. An example of re- 
sults obtained after performing successively the two tracking steps respectively 
involving components F\ and F 2 is shown on Fig.^^. 

The use of both an appropriate initialization and a motion-based prediction 
embedded in the first tracking step allows us to provide a real tracking of convec- 
tive cloud cells over time. We mean that we can effectively and reliably associate 
the extracted contours from one image to the next one, even in situations with 
no significant overlap between two corresponding contours. 

To save computational time, we make use of the “narrow band” framework 
introduced in in Moreover, we proceed each narrow band in an independent 
way. Then, if one of them contains a contour which has reached the desired cloud 
cell boundaries, the restriction of the function ip to the corresponding narrow 
band is frozen, and the computational cost is thus further reduced. 

5 Characterization of Convective Activity and Extraction 
Refinement 

The clouds located within the closed contours rjr issued from the tracking stage 
may include either truly active convective cell (CC), or CC in declining phase, 
or cold clouds which are not convective clouds. We have to identify regions un- 
dergoing strong vertical motion, corresponding either to growing or to declining 
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convective clouds. The vertical development of growing CC is accompanied with 
a spatial expansion at its top along with a temperature cooling. The opposite 
occurs for declining CC. Therefore, it seems particularly relevant, in order to 
qualify and to extract these convective activity areas, to jointly evaluate the de- 
gree of divergence of the 2D apparent motion and the tendency of the temporal 
changes in temperature. 

5.1 Discriminant Features of Convective Activity 

The first discriminant feature of convective activity is related to the dynamic 
properties of the cold clouds of interest. It is supplied by the local divergence of 
the estimated 2D motion field, computed at each point of the tracked cloud cell 
as explained in Section 0 

The temporal evolution of the cloud temperature provides the second discri- 
minant feature. We evaluate the temporal change of temperature at each point of 
the tracked cloud by considering the displaced frame difference supplied by the 
estimated 2D velocity field: / r ( p , w{p)) = I{p + w{p),t) — I{p, r — 1). To take 
into account motion compensation errors and image noise, we consider in fact a 
locally average version: 



Ir{p, ‘^(p)) = ^ ^(r),r) -I(r,T- Ij'j (10) 

where is a local window centered on pixel p and containing M pixels. An 
example of joint evaluation of local motion divergence and temporal tempera- 
ture variation can be found in Fig. 2. We can note the characteristic temporal 
evolution of a convective cloud cell (pointed with an arrow in images (a) and 

(d)). 




Fig. 2. In columns (a) and (d), part of infrared Meteosat images acquired on August 4, 
1995 at llhOO TU and 13h30 TU. Local motion divergence maps (columns (b) and (e)), 
computed on convective cloud cells selected after the tracking stage, and the temporal 
variations of temperature (columns (c) and (f)). Display in Fig. b, c, e and f varies 
from light grey (highly negative values) to black (highly positive values). 



The growing phase (Fig. 2a) presents strong positive values of motion diver- 
gence (dark grey in Fig. 2b) and a decrease in temperature (light grey in Fig. 
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2c). The subsequent declining phase (Fig. 2d) is identified by strong negative di- 
vergence values (light grey in Fig. 2e) and a warming up of the cloud top (dark 
grey in Fig. 2f). 



5.2 Extraction of Active Convective Clouds 

We have now to exploit these two discriminant features, i.e. div uj(p) and 
It{p,w{p)), to determine the active convective clouds among the tracked cold 
cloud areas. 

The CC characterization scheme is formulated as a labeling problem of these 
areas. We have adopted a contextual statistical approach based on Bayesian esti- 
mation (MAP criterion) associated with Markov Random Field (MRF) models. 
The MRF framework provides a powerful formalism to specify physical relati- 
ons between observations o (i.e. temporal changes of temperature, local motion 
divergences) and the label field e, while easily allowing us to express a priori 
information on the expected properties of the label field (i.e. spatial regulariza- 
tion). We consider three classes, two classes of activity, growing activity {“grow 
”) and declining activity {“ded ”), and one of inactivity {“nact ”). The last one 
can contain non active clouds but also elements which remain undetermined due 
to non significant feature values. 

Due to the equivalence between MRF and Gibbs distributions, it turns out 
that this leads to the definition of a global energy function U{o,e). We have 
designed the following energy function composed of a data-driven term and a 
regularization term: 



U (o, e) = Vi (o, e) -I- «2 E^2(e) 
ses cec 



( 11 ) 



where Vi and V 2 are local potentials, s is a site (here, a pixel), and C represents 
all the binary cliques c (i.e. cliques formed by two sites) associated with the 
considered second-order neighborhood system on the set of sites (pixels). 02 
controls the relative influence of the data-driven term and of the regularization 
term. 

As a matter of fact, we consider the two features introduced above in a 
combined way through the following product: 

p(s) = divu;(s) x Ir(s, (12) 

The adequacy between a given label and the computed quantity /i(s) is governed 
by the sign and the magnitude of p(s). If the two discriminant features present 
opposite signs at site s (p(s) < 0), this reveals convective activity, either growing 
activity (div u;(s) > 0 and It < 0) or declining one (div o;(s) < 0 and It >0). To 
further distinguish labels “grow ” and “ded ”, we examine the sign of div o;(s). 
If div u){s) < 0, potential Vi will favour the label “ded ”, otherwise the label 
“grow ” will be preferred. p{s) > 0 is not related to a specific physical meaning 
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and label “naci ” will be favoured. The potential Vi is defined at site s by: 

{ sign[div a;(s)]/(/i) + | if e(s) = grow 
— sign[div a)(s)]/(/r) + 5 if e(s) = ded (13) 

—fid) + 5 if e(s) = nact 

where f{x) is a smooth stepwise function. We have chosen f{x) = arctan(/c 7 ra:). 
Potential V 2 in the regularization term is defined by V 2 [e{r) , e{s)] = — /3 if 
e(r) = e(s) and V 2 [e(r) , e{s)\ = j3 otherwise, where r and s are two neigh- 
bour sites. V 2 favours compact areas of same label. This formulation leads to the 
minimization of the global energy function t/(o, e), which is solved iteratively 
using the deterministic relaxation algorithm ICM. 

The tracking process is concerned with all the cloud areas issued from the 
detection stage and not only with those from the characterization stage for the 
following reason. A large convective system can contain different zones of distinct 
activity which may evolve quickly over time. This temporal evolution does not 
allow us to perform a relevant and significant tracking of cloud cells displaying 
a real convective activity. Tracking cold clouds and characterizing in a second 
stage their convective activity appears to be more stable and physically more 
meaningful. 



6 Results 

We have carried out numerous experiments on real complex examples involving 
Meteosat infrared or water vapor images. Here, we report representative results 
obtained after each stage of the proposed scheme. Figures 0 and Elillustrate the 
tracking stage, figure 0 the characterization stage. 

For display convenience, pixels corresponding to low temperature will be 
represented by white intensities, and conversely. Figure ^contains three different 
meteorological situations. For each, we supply the initial contours corresponding 
to the cold clouds detected in the previous image (central row) and the final 
locations of the cold clouds (lower row) . We can observe that the tracking is quite 
accurate even in case of large displacements, (first example in the left column of 
Fig. 5) or in case of the formation of holes within a cloud cell (third example, 
in the right column of Fig. 01 ). Let us point out that forecasters are particularly 
interested by the accurate and reliable determination of colder areas of convective 
clouds, which are generally quite uniform. Thus, the stopping criterion we have 
designed stops the inner part of the evolving curve at the first encountered well- 
contrasted intensity edges. The consequence is that resulting contours may be 
located inside cloud cells. 

Examples of tracking of cloud cells over time are shown in Fig. El We can 
point out the accuracy and the temporal coherence of the obtained results, which 
is of key importance for forecasters, depicting successive meteorological satellite 
images over Italy and Sardigna. These warm European areas (dark grey level) 
are the source of convective activity. At lOhSO TU, small cloud cells over Italy 
reach higher altitude and are correctly detected in time. They grow, merge. 
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and progressively other surrounding cells undergo the same process. At 14h00 
TU (last column, last row), a convective cloud cell becomes too warm with 
respect to the characteristic temperature estimated from the previous image, 
and consistently disappears. The characterization of the convective activity of 




Fig. 3. Evolution of the cloud contours in the tracking stage. Upper row: original 
infrared images; central row: initial positions corresponding to those determined in the 
previous image; lower row: final positions of cloud contours. From left to right: part of 
Meteosat infrared images on August 24, 1995, at 18h30 TU, August 28, 1995, at 4h30 
TU, and August 10, 1995 at 21h00 TU. Contours are overprinted in black. 



the tracked clouds shown in Fig. g| is reported on Fig. 0 Six successive results 
obtained after the characterization stage are supplied. Dark grey corresponds 
to active clouds in a growing phase. On the opposite, CC in a declining state 
are labeled in light grey. At the beginning of the sequence, the central cloud cell 
undergoes a strong vertical motion and the whole corresponding area is corrected 
labeled as “growing activity”. Progressively, this cloud cell becomes less and 
less active, and the dark grey area shrinks toward its core. In the same time, a 
declining zone develops up to contain almost the entire area. The same parameter 
values are considered for all these experiments. Concerning the tracking stage, 
we set Fai = 10, Fa 2 = 10, e = 3, and the width of the narrow band is 8 
pixels. Temperature threshold Ith is — 35°C. In the characterization stage, we 
set k = 5, P = 0.1 and 02 = 3. The choice of parameter values associated to 
the tracking stage only affects the speed of convergence and not the accuracy 
of results. Parameter values related to the characterization stage only influence 
the labeling of the uncertain activity areas (i.e. containing weak discriminant 




Fig. 4. Results of the tracking stage over a sequence of Meteosat infrared images (two 
first rows). Illustration of the temporal coherence of the tracking stage of cold clouds. 
Final contours are overprinted on the original images (two last rows). Part of Meteosat 
images acquired on August 4, 1995 from lOhSO to 14h00 TU. 



feature magnitudes) and it was found that their setting was not critical. The 
tracking stage has been evaluated by forecasters on several real representative 
situations (including those reported in this paper) and appeared quite accurate. 
An extended experimental validation of the characterization stage is just about 
to be completed by a French meteorological center on the basis of a daily analysis 
by a forecaster in an operational context. First results are already convincing. 
The computational time is in accordance with operational requirements since 
Meteosat satellite images are acquired every thirty minutes. CPU time behaves 
as a linear function of the number of processed pixels (involved in the narrow 
band technique). It takes about six minutes for a quantity of processed pixels in 
the narrow bands equal to 128 x 128 on a Sun Ultra 60 workstation. 



7 Conclusion 



We have proposed in this paper an original and efficient framework to detect, 
track and characterize convective cold clouds from meteorological satellite ima- 
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Fig. 5. Characterization stage applied to Meteosat infrared images. Labeling results are 
overprinted on the original images: in dark, resp. light, grey: growing, resp. declining, 
convective clouds. Part of Meteosat images acquired on August 4, 1995 from lOhSO TU 
to 14h00 TU. 



ges. It involves two main stages, the tracking stage relying on the Level Set 
formalism, and the characterization stage stated as a statistical contextual la- 
beling issue. This approach is quite relevant to properly process such highly 
deformable structures which are often subject to splitting or merging phases 
during their life cycle. We have designed a two-step tracking scheme exploiting 
both motion and photometric information in an adequate way. The first step 
exploits a 2D estimated motion field, and supplies a proper prediction to the 
second one. The former moves contours along the direction of estimated motion 
while immediately taking into account topological changes contrary to a usual 
registration step. The second step uses photometric information at a local level 
and at the cell level, and can create appropriate expansion or contraction forces 
on the evolving contours to accurately localize in every image the cold clouds of 
interest. 

The characterization stage relies on local measurements involving divergence 
computed from the estimated 2D motion field and local temporal variations of 
the tracked clouds. It leads to the minimization of an energy function comprising 
a spatial regularization term. It allows us to extract, within the clouds delimited 
in the tracking stage, the regions of significant vertical motion, i.e. the really ac- 
tive convective cloud cells and to distinguish those in a growing phase from those 
in a declining phase. The computational time, which is usually a drawback of 
the Level Set approach, is significantly reduced, thanks to the two-step tracking 
scheme introduced in our method. Besides, the first tracking step, appropriately 
exploiting motion information, leads to positions of the curve overlapping the 
real boundaries of the cold clouds of interest. The second tracking step can then 
start from this prediction since the designed associated speed function allows a 
curve to evolve in two ways, contraction and expansion. Another advantage of 
this method is that results do not strongly depend on the choice of the parame- 
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ter values. Results obtained on numerous difficult real examples demonstrate the 
temporal coherence and the accuracy of the extracted convective clouds tracked 
over time, which provides forecasters with an easily understanding of the me- 
teorological situation. Finally, the tracking method introduced in this paper is 
not specific to the considered application, and could be successfully applied to 
other kinds of deformable structures. 
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A Unifying Theory for Central Panoramic 
Systems and Practical Implications 
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Abstract. Omnidirectional vision systems can provide panoramic alert- 
ness in surveillance, improve navigational capabilities, and produce panor- 
amic images for multimedia. Catadioptric realizations of omnidirectional 
vision combine reflective surfaces and lenses. A particular class of them, 
the central panoramic systems, preserve the uniqueness of the projection 
viewpoint. In fact, every central projection system including the well 
known perspective projection on a plane falls into this category. 

In this paper, we provide a unifying theory for all central catadioptric 
systems. We show that all of them are isomorphic to projective mappings 
from the sphere to a plane with a projection center on the perpendicular 
to the plane. Subcases are the stereographic projection equivalent to 
parabolic projection and the central planar projection equivalent to every 
conventional camera. We define a duality among projections of points and 
lines as well as among different mappings. 

This unification is novel and has a a significant impact on the 3D in- 
terpretation of images. We present new invariances inherent in parabolic 
projections and a unifying calibration scheme from one view. We describe 
the implied advantages of catadioptric systems and explain why images 
arising in central catadioptric systems contain more information than 
images from conventional cameras. One example is that intrinsic cali- 
bration from a single view is possible for parabolic catadioptric systems 
given only three lines. Another example is metric rectification using only 
afSne information about the scene. 



1 Introduction 

Artificial visual systems face extreme difficulties in tasks like navigating on un- 
even terrain or detecting other movements when they are moving themselves. 
Paradoxically, these are tasks which biological systems like insects with very 
simple brains can very easily accomplish. It seems that this is not a matter 
of computational power but a question of sensor design and representation. The 
representation of visual information has to be supported by the adequate sensors 
in order to be direct and efficient. It is therefore surprising that most artificial 
visual systems use only one kind of sensor: a CCD-camera with a lens. 

We believe that the time has come to study the question of representation 
in parallel to the design of supportive sensing hardware. As in nature these sen- 
sors and representations should depend on the tasks and the physiology of the 
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observer. Omnidirectional or panoramic visual sensors are camera designs that 
enable capturing of a scene with an almost hemi-spherical field of view. Origi- 
nally introduced mainly for monitoring activities they now are widely used in 
multimedia and robotics applications. The advantages of omnidirectional sensing 
are obvious for applications like surveillance, immersive telepresence, videocon- 
ferencing, mosaicing, and map building. A panoramic field of view eliminates 
the need for more cameras or a mechanically turnable camera. We prove in this 
paper that a class of omnidirectional sensors, the central panoramic systems, 
can recover information about the environment that conventional models of per- 
spective projection on a plane cannot. 

First let us summarize recent activities on omnidirectional vision. A panoramic 
field of view camera was first proposed by Rees After 20 years the concept 
of omnidirectional sensing was reintroduced in robotics m for the purpose of 
autonomous vehicle navigation. In the last five years, several omnidirectional 
cameras have been designed for a variety of purposes. The rapid growth of mul- 
timedia applications has been a fruitful testbed for panoramic sensors [7181111 
applied for visualization. Another application is telepresence HHD where the 
panoramic sensor achieves the same performance as a remotely controlled rotat- 
ing camera with the additional advantage of an omnidirectional alert awareness. 
Srinivasan | 2 | designed omnidirectional mirrors that preserve ratios of elevations 
of objects in the scene and Hicks 0 constructed a mirror-system that rectifies 
planes perpendicular to the optical axis. The application of mirror-lens systems 
in stereo and structure from motion has been prototypically described in [ 1 6l4j . 
Our work is hardly related to any of the above approaches. The fact that lines 
project to conics is mentioned in the context of epipolar lines by Svoboda m 

and Nayar GDI- 

Omnidirectional sensing can be realized with dioptric or catadioptric sys- 
tems. Dioptric systems consist of fish-eye lenses while catadioptric systems are 
combinations of mirrors and lenses. These sensors can be separated into two clas- 
sifications, determined by whether they have a unique effective viewpoint. Coni- 
cal and spherical mirror systems as well as most fish-eye lenses do not possess a 
single vantage-point. Among those that do have a unique effective viewpoint are 
systems which are composed of multiple planar mirrors and perspective cameras 
all of whose viewpoints coincide, as well as a hyperbolic mirror in front of a per- 
spective camera, and a parabolic mirror in front of a orthographic camera. The 
uniqueness of a projection point is equivalent to a purely rotating planar camera 
with the nice property that a rotated image is a collineation of an original one. 
Hence, every part of an image arising from such a catadioptric sensor can easily 
re-warped into the equivalent image of a planar camera looking to the desired 
direction without knowledge of the depths of the scene. It is worth mentioning 
that simple dioptric systems — conventional cameras — are included in this class 
of catadioptric systems because they are equivalent to catadioptric systems with 
a planar mirror. 

In this paper, we present a unifying theory for all central panoramic systems, 
that means for all catadioptric systems with a unique effective viewpoint. We 
prove that all cases of a mirror surface — parabolic, hyperbolic, elliptic, planar — 
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with the appropriate lens — orthographic or perspective — can be modeled with 
a projection from the sphere to the plane where the projection center is on a 
sphere diameter and the plane perpendicular to it. Singular cases of this model 
are stereographic projection, which we show to be equivalent to the projection 
induced by a parabolic mirror through a orthographic lens, and central projection 
which is well known to be equivalent to perspective projection. 

Given this unifying projection model we establish two kinds of duality: a 
duality among point projections and line projections and a duality among two 
sphere projections from two different centers. We show that parallel lines in space 
are projected onto conics whose locus of foci is also a conic. This conic is the 
horizon of the plane perpendicular to all of the original lines, but the horizon 
is obtained via the dual projection. In case of perspective projection all conics 
degenerate to lines and we have the well known projective duality between lines 
and points in . 

The practical implications are extremely useful. The constraints given by 
the projection of lines are natural for calibration by lines. We prove that three 
lines are sufficient for intrinsic calibration of the catadioptric system without any 
metric information about the environment. We give a natural proof why such a 
calibration is not possible for conventional cameras showing thus the superiority 
of central catadioptric systems. The unifying model we have provided allows us 
to study invariances of the projection. Perhaps most importantly, in the case 
of parabolic systems we prove that angles are preserved because the equivalent 
projection — stereographic — is a conformal mapping. This allows us to estimate 
the relative position of the plane and facilitates a metric rectification of a plane 
without any assumption about the environment. 

In section 2 we prove the equivalence of the catadioptric and spherical pro- 
jections and develop the duality relationships. In section 3 we present the com- 
putational advantages derived from this theory and in section 4 we show our 
experimental results. 



2 Theory of Catadioptric Image Geometry 

The main purpose of this section is to prove the equivalence of the image geome- 
tries obtained by the catadioptric projection and the composition of projections 
of a sphere. We first develop the general spherical projection, and then the cata- 
dioptric projections, showing in turn that each are equivalent to some spherical 
projection. Then we will show that two projections of the sphere are dual to 
each other, and that parabolic projection is dual to perspective projection. 



2.1 Projection of the Sphere to a Plane 

We introduce here a map from projective space to the sphere to an image plane. 
A point in projective space is first projected to an antipodal point pair on the 
sphere. An axis of the sphere is chosen, as well as a point on this axis, but 
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within the sphere. From this point the antipodal point pair is projected to a pair 
of points on a plane perpendicular to the chosen axis. 

Assume that the sphere is the unit sphere centered at the origin, the axis is 
the z-axis and the point of projection is the point (0, 0, 1). Let the plane z = —m 
be the image plane. If ~ is the equivalence relation relating antipodal points on 
the sphere, then the map from projective space to the sphere s : P^(IR) ^ S'^/^ 
is given by 

s{x,y,z,w) = [±-,±-,±-] 

\ r r r / 



where r = + y'^ + z^. To determine the second part of the map, we need 

only determine the perspective projection to the plane z = —m from the point 
(0,0,Z). Without taking the equivalence relation into account the projection of 
{x,y,z) is 



pi.m{x,y,z) 



f x(l + m) y{l + m) 
\ Ir — z ’ Ir — z 




Now applying the equivalence relation we have a map : P^(IR) ^ K?,, 



Plm{x,y,z,w) 



/ o;(Z + to) ^y{l + m) 
\ Irzjiz' Ir^z 




Here is with the equivalence relation induced on it by the map pi^m and 
^ on the sphere. 

If we move the projection plane to z = —a, then the relation between the 
two projections is 

Pi,mix,y,z,w) = — Pi {x,y,z,w) . 

’ i + a ’ 



So they are the same except for a scale factor. Thus if m is not indicated it is 
assumed that m = 1. 



Remark. When I = 1 and m = 0, i.e. the point of projection is the north pole, 
we obtain 



Pi.p{x,y,z,w) = ± 






■ z2 =F z 



,± 






■ z2 =F z 



which is a case of stereographic projection |0| (when (x, y, z) is restricted to 
the sphere). On the other hand, when I = 0 and m = 1, we have perspective 
projection: 

Po.i(a;,2/,2,w) = (-> -) • 

\ z z/ 



2.2 Catadioptric Projection 

In this section we will describe the projections using conical section mirrors. 
Throughout the section we will refer to figures El and El 
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P = (x,y,z,w) 



m 



X 



Fig. 1. A point P — (x, y, z, w) is projected via s to two antipodal points 

[±x,±y,±z)/r on the sphere. The two antipodal points are projected to the image 
plane z = — m via projection from the point (0, 0, 1 ). 

Parabolic Mirror. We call the projection induced by a parabolic mirror to 
an image plane a parabolic projection. The parabolic projection of a point P in 
space is the orthographic projection of the intersection of the line PF (where F 
is the parabola’s focus) and the parabola. The orthographic projection is to the 
image plane perpendicular to the axis of the parabola. Any line (in particular 
a ray of light) incident with the focus is reflected such that it is perpendicular 
to the image plane, and (ideally) these are the only rays that the orthographic 
camera receives. 

The projection described is equivalent to central projection of a point to the 
parabola, followed by standard orthographic projection. Thus we proceed in a 
similar fashion as we did for the sphere. Assume that a parabola is placed such 
that its axis is the z-axis, its focus is located at the origin, and p is its focal 
length. Then 



is the surface of the parabola. Now define ~ such that if P,Q G S, then P ^ Q 
if and only if there exists a A S IR such that P = \Q. We now determine the 
projection Sp : P^(IR) ^ S/~, 





where r = \/ + z^. The next step is to project orthographically to the 
plane z = 0 (the actual distance of the plane to the origin is of course inconse- 
quential). We thereby obtain g* : P^(IR) ^ IR^/~ given by 
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Fig. 2. Cross-section of a parabolic mirror. The image plane is through the focal point. 
The point in space P is projected to the antipodal points P' and P" , which are then 
orthographically projected to Q' and Q" respectively. 




Fig. 3. Cross-section of a hyperbolic mirror, again the image plane is through the 
focal point. The point in space P is projected to the antipodal points P' and P" , 
which are then perspectively projected to Q' and Q" from the second focal point. 
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Again, IR^/~ is with the equivalence relation carried over by orthographic 
projection of the parabola. 

Remark. Note that 

q*{x,y,z,w) = 2ppl g{x,y,z,w) = pl 2 p-i{x,y, z,w) . 



Hyperbolic and Elliptical Mirrors. As with the paraboloid, hyperbolic pro- 
jection is the result of reflecting rays off of a hyperbolic mirror. Rays incident 
with one the focal points of the hyperbola are reflected into rays incident with 
the second focal point. To obtain the projection of a point, intersect the line 
containing the point and the focal point with the hyperbola. Take the two in- 
tersection points and projection them to the image plane. The same applies to 
ellipses. 

Assume a hyperbola is placed such that its axis is the 2 -axis, one of its foci 
is the origin, the other (0, 0, —d), and its latus rectum is 4p. Then the surface of 
the hyperbola is 




where 

a = i ^\/ -I- — 2p^ , and b = \J p \/ — 2^ . 

Let ~ be similarly defined for points of S', identifying antipodal points of the 
hyperbola’s surface with respect to the focus. The projection Sp^d{x,y, z,w) : 
P^(IR) —>3/'^ may be obtained by intersecting the line through the point and 
the origin, however it is of too great a length to include here. Nevertheless, 
once obtained we then proceed by applying a perspective projection of the the 
antipodal point pair given by Sp^d(x, y, z, w) from the point (0, 0, —d) to the plane 
2 = 0, calling this projection : P^(IR) ^ lR^/~. We find that 



x*p,d{x,y,z,w) = ± 



2xdp/ yjd3 + 4p2 2ydp/ d? + 4p2 



±- 






r ^ z 






r ^ z 



where r = x'^ + y'^ + 2 ^. 



Remark. Notice that 



r*r,.d(x,y,z,w) = p * ^ rf(i, 2 p) (a:,y,2,w) . 

\/d‘^+4p‘^ \/d2-f-4p2 

For an ellipsoid similarly placed so that its foci are (0,0,0) and (0,0,— d), 
and latus rectum of 4p, we have 
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where 

a = i + 2p^ , and b = \j p\/ d? + + 2^ . 

We then derive that t* ^(a;, y, z, w) : P^(IR) ^ IR^/~ is given by 

f ^ , 2xdp 2ydp 

j[x,y, z,w) = ± , , ± , 

y dr ± + 4p2 dr ± + 4p^ 

Remark. We have that satisfies 

t*p,d{x,y,z,w) = r*^^{x,y,-z,w) = p * ^ d(i- 2 p) {x,y,-z,w) . 

\J d^-l-4p^ s/ d^+4p2 

Thus the ellipse gives the same projection as the hyperbola, modulo a refiection 
about z = 0. 




2.3 Equivalence of Catadioptric and Spherical Projections 

From the discussion above we may write a general theorem which will allow us 
to more generally develop the theory of catadioptric image geometry. We have 
the following central theorem. 

Theorem 1 (Projective Equivalence). Catadioptric projection with a single 
effective viewpoint is equivalent to projection to a sphere followed by projection 
to a plane from a point. 

Proof. In the past two sections we have the following relationships for the pro- 
jection functions: 

P * d d(i- 2 p) {x, y, z, w) (hyperbola < — > sphere) , 

\/ d^-|~4p2 \/ d2-b4p2 

P * d d(i- 2 p) (a:, y, -z, w) (ellipse < — > sphere) , 

Pi, 2 p- 1 (x, y, z, w) (parabola < — > sphere) , 

p*Qf{x,y,z,w) (perspective < — > sphere) . 

Each are maps from P^(IR) to IR^, and for any point in space the relations show 
that they map to the same point in the image plane. □ 

We now have a unified theory of catadioptric projection, and in further discus- 
sion we need only consider projections of the sphere. In the interest of conciseness 
we wish to give a name to this class of projections. We write 7r/,m to represent 
the projective plane induced by the projection Recall that if ^ = 1 then we 
have the projective plane obtained from stereographic projection, or equivalently 
parabolic projection. If ? = 0 then we have the projective plane obtained from 
perspective projection. 

Having demonstrated the equivalence with the sphere we now wish to describe 
in more detail the structure of the projective plane We therefore describe 



r*p,di^,y,z,w) = 
t*p,di^^y^z,w) = 
q*{x,y,z,w) = 
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the images of lines under these projections, therefore the “lines” of the projective 
planes. But because of the equivalence with the sphere, we may restrict ourselves 
to studying the projections of great circles and antipodal points. Thus let : 
]R^/~ project points of the sphere to the image plane. Figure 2] shows 
the projection of the great circle to the image plane, the equator is projected to 
a circle of radius ; this is the horizon of the fronto-parallel since the equator 
is the projection of the line at infinity in the plane z = 0. The proposition below 
describes the family of conics which are images of lines. 

Proposition 1. The image of a line is a conic whose major axis (when it exists) 
intersects the image center. It has the property that it intersects the fronto- 
parallel horizon antipodally and its major axis intersects the image center. 

Proof. Note first that the intersection of a great circle (which is itself the image 
of a line in space) with the equator are two points which are antipodal. Their 
projection to the image plane gives two points which again are antipodal on the 
image of the equator. The image of the great circle must be symmetric about 
the axis made by the perpendicular bisector of the two intersection points. This 
axis contains the image center since the midpoint of the intersection points is 
the image center. 

The actual image may be obtained by taking a cone whose vertex is the point 
of projection (0,0, Z) and which contains the great circle, then intersecting the 
cone with the image plane. The intersection of a plane and a skew cone is still a 
conic. □ 




Fig. 4. A line in space is projected to a great circle on the sphere, which is then 
projected to a conic section in the plane viap* ^,. The equator is mapped to the fronto- 
parallel horizon, the dotted circle in the plane. 



Note that if a conic has the properties in the proposition it is not necessarily 
the image of a line. There is an additional constraint on the foci of the conic. 
Let us therefore determine the image of a great circle. Let h = {nx,riy,nz) be 
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the normal of a plane containing some great circle. To obtain the quadratic form 
of the conic section we find the quadratic form of the cone through (0, 0, 1) and 
the great circle. To do this we first rotate to a coordinate system {x' ,y\ z') such 
that the great circle lies in the plane z' = 0. Then the vertex of the cone is 
— nl,0,lnz). Points of the cone, in the rotated coordinate system, then 
satisfy 



/ 



1 
^0 






0 

'£■2 

1 0 






V 0 0 -V 

By rotating back to the original coordinate system we have. 



'-nla-l‘^{nl + nlnl) 

p \ {P - l)nxriya 

z ^ 



— l)nxnya luxUzu'' 

—riya — P{riy + luyUzCt I p 



In. 



yTlzC^ 



where a = — 1 = + n^. Let C* be the matrix above. From this form we 

may extract the axes, center, eccentricity and foci, finding that 



^ ( {l-\-m)nx\nz\ {l+m)ny\nz\\ 

1 ni+ni-P ’ n'i+nl-P > 



f± = 



/ (l+m)rix(^\nz\±^l-P'j 



a = 



l(l-\-m)nz 
l^m 






6 = 



e = 












(/+m)riy (| I ± y/l — ) 



(center) , 

(foci) , 

(minor axis) , 
(major axis) , 

(eccentricity) . 



Meet and join. We find that the set of “points” of the projective plane ni^rn, 



n(7T/ ,7^) — 

A line is the set of points. 



{I + m)mx 
l^niz ’ l^rriz 



h G 



[«] = I (a:, 2/) I {xy)Cn | 

Thus the set of “lines” of the projective plane 

= {[fi]\n G S'^} . 



We may then define the operator meet A : A{TTi^m) x n(7T;_m) to take a 

pair of lines to their intersection, and the operator join V : II(7r;_m) x n(7r/_m) ^ 
A{TTi^m) to take a pair of points to the line through them. 
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2.4 Duality 

In this section we will show that two projections of the sphere are dual to each 
other. The antipodal point pairs of one projection are the foci of line images 
in another projection, and vice versa. When the projection is stereographic (i.e. 
parabolic) the dual is the usual perspective projection. 

We have seen that images of lines are conics, we would like to know if there 
is anything special about families of line images which intersect the same point. 
A set of longitudes on the sphere all intersect in two antipodal points, what are 
their projections? It is clear that the images must all intersect in two points 
since incidence relationships are preserved, but is there anything special about 
this particular pencil of conics? 

Proposition 2. The locus of foci of a set of line images, where the great circles 
corresponding to the lines intersect antipodally, is a conic whose foci are the 
images of the intersection points. 

Proof. Assume I and m are constant. Choose some point m = {mx,rriy,mz) on 
the sphere, by rotational symmetry we may assume without loss of generality 
that my = 0. The normals of all lines perpendicular to to, i.e. those which 
intersect in, are 

(n®,n^,n®) = (to^ sin 0, cos 0 , TOz sin 6*) . 

Substituting into the formula found for the first focus, we have 

r9 ^ ( {I + ■m)nl (nf + Vl - P) (I + to)u^ + Vl - P) \ 

'”V nf + nf-P ’ nf + nf-P ) 

/ (^ + m)mx sin 6 {I + m) cos 9 
V Vl — P — iTiz sin 6 ’ Vl — P — ruz sin 9 

But this is just the projection of (n®,ny,n^) by 

* 




So let I' = Vl — P and to' = I + m — ^/l — P. Under the projection pf 
the image of the great circle perpendicular to (mx,my,mz), i.e. the points 
{(n^, n^, n®)}, is once again a conic. One of its foci is 



(r + m')mx (ruz + \/l — l'^^ 



n1 — I' 



/ {I + m)mx 

\ I '^z J 



This is the image of {mx, 0, my) under P/ 

Define the map fpm such that given the normal of a line it produces the foci 
of the line’s image. Note that this map is injective and therefore its inverse is 
well defined. We have the following theorem on duality. 
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Fig. 5. The two ellipses are projections of two lines in space. Their foci fi, F 2 , and 
Gi , G 2 respectively lie on a hyperbola containing the foci of all all ellipses through U 
and V. The foci of this hyperbola are the points U and V. 



Theorem 2 (Duality). Given the two projective planes tti = irpm arid tt 2 = 
where I, m, V and m! satisfy 

+ 1'"^ = 1 and I + m = I' + m' , 

the following is true, 

fprn ■ A(7Ti) ^ II(7r2), 

: n(7T2) - A(7 Ti), 
fv^rn' ■ A(7T2) ^ n(7Ti), 
fvln' ■ n(7Tl) ^ A(7T2) . 

In fact the two projective planes tti and tt 2 are dual. Ifh, I 2 are lines o/tti and 
Pi, P 2 are points o/tti, then: 

A h) = fi,m{h) V fpmih) 

fl,m{pi V P 2 ) = //TL'(Pi) a /;TL'(P2) ■ 

Proof. The preceding proposition showed that the foci of a pencil of lines {l\} 
lied on a conic c, where c G A(7T2). The foci of c were the two points of intersection 
of the pencil of lines, so fv,m'{c) = l\^ A l\.^. But c = fpm{l\i) V fpmihj, and 
so 

A ^>, 2 ) — fl,m{l\i) V fl^m{l\2) ■ 

The second is true because so is the dual to the proposition, namely that a set 
of collinear points (in tti ) produce a line whose foci are a single point of 7T2 . □ 
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Corollary 1. The projective planes tti^q 0 ''nd ttq,! are dual. The first is obtained 
from stereographic projection, and the second from perspective projection. The 
center of every circle in a parabolic projection is a point in the perspective projec- 
tion; every point of a parabolic projection is a focal point of a line in a perspective 
projection. 

3 Advantages of Catadioptric Projection 

The presented unifying theory of catadioptric projections enables a direct and 
natural insight on the invariances of these projections. The perspective projection 
is a degenerate case of a catadioptric projection which fact as we will show 
directly reveals its inferiority to the other catadioptric projections (parabolic 
and hyperbolic). 



3.1 Recovery of Geometric Properties 

We have shown that parabolic projection is equivalent to stereographic projec- 
tion, as well as being dual to perspective projection. Stereographic projection is 
a map with several important properties. First the projection of any circle on 
the sphere is a circle in the plane. In particular the projection of a great circle is 
a circle. What is also important is that the map is conformal. The angle between 
two great circles (i.e. the inverse cosine of the dot product of the normals of their 
planes) is the same angle between the circles which are their projections. This 
is important because it means for one thing that if two circles are horizons of 
some planes, and they are orthogonal, then the planes are perpendicular. 

Corollary 2. The angle between great circles on the sphere is equal to the angle 
between their projections. 

Proof of this fact is given in almost any book on geometry, e.g. H2|, and is 
a direct result of the fact that stereographic projection is a conformal mapping. 
This implies that the angles between the horizons of two planes is equal to the 
angle between the two planes; orthogonal planes have orthogonal horizons. 



3.2 Calibration 

Almost all applications in computer vision require that the imaging sensor’s 
intrinsic parameters be calibrated. The intrinsic parameters include focal length, 
image center and aspect ratio, as well as any other parameters which determine 
the projection induced by the sensor such as radial distortion. Sometimes it 
is possible to calibrate one or more of those parameters with minimal prior 
information about scene geometry or configuration. For example, it has been 
shown that radial distortion can be calibrated for, using only the images of lines. 
The only assumption is that points have been gathered in the image which are 
projections of points in space lying on some straight line. Using this information 
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not only is it possible to determine the radial distortion parameters, but the 
image center also may be obtained. 

We have shown prior to this work that it is possible to calibrate all of the 
intrinsic parameters of a parabolic catadioptric sensor, again using only lines. 
Let us gain some intuition as to why this is true, and why it is not possible to 
calibrate a normal perspective camera with these simple assumptions. 

First examine the perspective case. Assuming that aspect ratio is one, there 
are two intrinsic parameters, namely the image center and focal length. The 
image of a line in space is a line in the image plane, and any given line may be 
uniquely determined by two points. From any image line it is possible only to 
determine the orientation of the plane containing the line in space and the focal 
point; the orientation of this plane can be parameterized by two parameters. 
Given n lines, how many constraints are there and how many unknowns? If for 
some n the number of constraints exceeds the number of unknowns, then we 
have a hope of obtaining the unknowns, and thus calibrate the sensor. However, 
for every line added we gain two more constraints and two more unknowns; we 
will always be short by three equations. Therefore self-calibration from lines, 
without any metric information, and in one frame is hopeless in the perspective 
case. 

What about the parabolic case? There are a total of three unknowns, focal 
length and image center (alone giving two unknowns). The projection of any line 
is a circle, and which is completely specified by as few as three points, therefore 
three constraints. The orientation of the plane containing the line gives two 
unknowns. So, for every line that we obtain we reduce the number of unknowns 
by one. If there are three lines, we have 9 constraints and 9 unknowns, and thus 
we can perform self-calibration with only three lines. 

Finally the hyperbolic case. There are four unknowns and each line adds two 
for orientation. The projection of a line is a conic which may be specified by five 
points. Thus when we have two lines we have 8 unknowns and 10 constraints. 
So, with only two lines the system is over-determined, but nevertheless we can 
still perform a calibration. 

We give here a simple and compact algorithm for calibrating the parabolic 
projection. It is based on the fact that a sphere whose equator is an image line 
in the image plane contains the point {cx,Cy,2p), where (cx,Cy) is assumed to 
be the image center, though initially unknown. This is by symmetry, since the 
image circle intersects the fronto-parallel plane at points a distance 2p from the 
image center. Thus the intersection of at least three spheres so-constructed will 
produce the points (cx,Cy,iL2p), giving us both image center and focal length 
simultaneously. 

In the presence of noise, the intersection will not be defined for more than 
three spheres, yet we may minimize the distance from a point to all of the spheres, 
i.e. find the point (cx,Cy,p) such that 

n 

{{dx ~ + (dy — Cy)^ + Ap^ — rl) 

i=l 



( 1 ) 
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is a minimum over all points. Here {d\.,d'y) is the center of the i-th image circle, 
and Ti is its radius. The intersection is not defined for fewer than three spheres, 
since the intersection of two spheres gives only the circle within which the point 
lies, but not the point itself. 

4 Experiment 

We present here a short experiment with real data as a proof of concept. We will 
show how given a single catadioptric image of a plane (left image in Fig.El we can 
recover the intrinsic parameters of the camera and metrically rectify this plane, 
too. The system used is an off-shelf realization (SI model. Cyclovision Inc.) of a 
parabolic catadioptric system invented by Nayar 0. The algorithm detects edge 
points and groups them in elliptical arcs using a Delaunay triangulation of the 
points and a subsequent Hough transform. An ellipse fitting algorithm is then 
applied on the clustered points. The aspect ratio is eliminated and the ellipses 
are transformed to circles (Fig. 0 middle). We additionally assume that these 
lines are coplanar and that they belong to two groups of parallel lines. However, 
we do not make any assumption about the angles between these lines. 

From the parallelism assumption we know that the intersections of the circles 
are the antipodal projections of vanishing points. The calibration theory devel- 
oped above tells us that the intersection of the lines connecting the antipodal 
points gives the image center. 

Two vanishing points and the focal point define a plane parallel to the plane 
viewed. Imagine the horizon of this plane (line at infinity) defined by the two 
sets of vanishing points. Imagine also a pole on the sphere corresponding to the 
plane spanned by the horizon and the focal point. The parabolic projection of the 
horizon is a circle (the horizon conic) and its center is the projection of the pole. 
However, this pole gives exactly the normal where all the lines lie. This center 
is the dual point to the line which is the horizon of the perspective projection. 
Given the calculation m the focal length is directly obtained. This focal length 
is the effective focal length required for any operation in the catadioptric system 
(we can not decouple the mirror from the lens focal length) . We have thus been 
able to compute image center, focal length, and the normal of a plane without 
assuming any metric information. We visualize the result on the right of Fig. El 
where we have rectified the ceiling plane so that it looks as if it were fronto- 
parallel. Unlike the planar perspective case p| metric rectification of a plane 
from a single image is possible with a parabolic catadioptric system without any 
metric information. 

5 Conclusion 

In this paper, we presented a novel theory on the geometry of central panoramic 
or catadioptric vision systems. We proved that every projection can be mod- 
eled with the projection of the sphere to a horizontal plane from a point on the 
vertical axis of the sphere. This modeling includes traditional cameras which 
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Fig. 6. Top left: Original image of the ceiling recorded by the catadioptric camera 
slanted approx. 45 deg. with respect to the ceiling. Top right: Two groups of four and 
three circles, respectively, fitted on the images of the ceiling-edges. The lines through 
the vanishing points intersect at the image center and all the vanishing points lie on 
a circle. Bottom middle: Both, the collinearity of the edge elements and the perpen- 
dicularity of the edges show a superior performance in estimating intrinsics as well as 
pan-tilt of the ceiling using only natural landmarks. 



are equivalent to a catadioptric projection via a planar mirror. In this case the 
projection point of our model is the center of the sphere. In the parabolic case, 
the projection point becomes the north-pole and the projection is a stereogra- 
phy. The conformal mapping properties of the stereography show the power of 
the parabolic systems. Hyperbolic or elliptical mirrors correspond to projections 
from points on the vertical diameter within the sphere. We showed that pro- 
jections of point and lines in space are points and conics, respectively. Due to 
preservation of the incidence relationship we can regard the conics as projective 
lines. We showed that these projective lines are indeed dual to the points and 
vice versa. 

Very useful practical implications can be directly derived from this theory. 
Calibration constraints are natural and we provided a geometric argument why 
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all catadioptric systems except the conventional planar projection can be cal- 
ibrated from one view. We gave an experimental evidence using a parabolic 
mirror where we also showed that metric rectification of a plane is possible if 
we have only affine but not metric information about the environment. We plan 
to extend our theory to multiple catadioptric views as well as to the study of 
robustness of scene recovery using the above principles. 
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Abstract. We consider the problem of aligning and calibrating a bino- 
cular pan-tilt device using visual information from controlled motions, 
while viewing a degenerate (planar) scene. 

By considering the invariants to controlled motions about pan and eleva- 
tion axes while viewing the plane, we show how to construct the images 
of points at inhnity in various visual directions. First, we determine an 
ideal point whose visual direction is orthogonal to the pan and tilt axes, 
and nse this point to align the rig to its own natural reference frame. Se- 
cond, we show how by combining stereo views we can construct further 
points at infinity, and determine the left-right epipoles, without compu- 
ting the full epipolar geometry and/or projective structure. Third, we 
show how to determine the inhnite homography which maps ideal points 
between left and right camera images, and hence solve for the two focal 
lengths of the cameras. The minimum requirement is three views of the 
plane, where the head undergoes one pan, and one elevation. 

Results are presented nsing both simulated data, and real imagery ac- 
quired from a 4 degree-of-freedom binocular rig. 



1 Introduction 

It has been recognised for some time that self-calibration of a mobile camera is 
often possible from image information alone, and various algorithms have been 
presented in the literature, addressing different aspects of the problem [7IW 
EH. Almost all of the these algorithms involve (either directly or indirectly) 
computing any or all of the epipolar geometry, the camera locations and the 
projective structure of the scene viewed, and fail when this cannot be achieved. 
This occurs if the scene viewed is planar, for example. 

Recently, Triggs m has shown that five views of a planar {ie. degenerate) 
scene are, in principle, sufficient for self calibration. This is a notable result 
since in this case the scene structure is related between images by a 2D ho- 
mographic transformation, the fundamental matrix is under-determined, and 
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back-projection of image points to obtain projective structure is not possible. 
The disadvantage of Triggs’ method is that it relies on a bundle adjustment 
procedure with no clear-cut initialisation. 

In this paper we are concerned with the same problem for a binocular rig 
viewing the planar scene. Clearly from Triggs’ result we could infer naively that 
five binocular views would be sufficient to determine the intrinsics of each ca- 
mera. However this approach neglects the constraints imposed by the stereo rig, 
and does not solve the problem of initialising the iterative minimisation. Instead, 
we make two (reasonable) assumptions about our binocular rig: (i) its internal 
geometry is unchanging (including unchanging camera internal parameters, alt- 
hough they may be different for the two cameras); and (ii) controlled rotations 
of the rig are possible - in particular, we can perform zero-pitch screw motions 
(it. rotations around an axis with no translation along the axis). Consequently, 
we prove a number of useful results related to self-calibration of the rig. In par- 
ticular: 

— We consider the invariants to a zero-pitch screw motion from a geometrical 
point of view, and how they relate to the invariants of a 2D image-to-image 
homography induced by a plane, or rotation about the camera optical centre; 

— Using these invariants we show how to construct the image locations of points 
at infinity irrespective of scene degeneracy. One such point turns out to have 
visual direction orthogonal to the rotation axes and is therefore a natural 
alignment reference; 

— We show how by combining multiple views of the scene plane, we can con- 
struct further points at infinity (and incidentally compute the epipolar geo- 
metry of the head); 

— We use the constructed points at infinity to determine the infinite homo- 
graphy relating points at infinity between cameras, and in certain cases, the 
camera parameters. 

In summary, we obtain a method to determine a closed-form solution for 
the infinite homography, and hence two focal lengths from three stereo views 
of a plane. Additionally (since we can compute the epipolar geometry), we can 
backproject points to obtain the plane at infinity. 

We demonstrate these results with respect to a typical binocular pan-tilt head 
used for autonomous navigation (Fig.Q), an application in which the knowledge 
of the relationships between the world and camera images (the calibration) is 
crucial. 

The use of invariants stems from previous work in self-alignment of a stereo 
head, where the head is rotated so that the cameras are parallel and their optical 
axes lie perpendicular to both the pan and elevation axes. This is important 
since it provides a means by which the accurate relative angles provided by joint 
encoders may be upgraded to absolute measurements. The original work |TC1 
cn] looked at invariants to motion in 3D projective space, easily recoverable for 
non-degenerate scenes. We show how, without needing to calibrate the cameras 
or calculate epipolar geometry, 2D invariants can provide the same alignment 
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Fig. 1. A typical stereo pan-tilt head, mounted on a mobile robot 



information even when degeneracy is present, and we present results to confirm 
the accuracy of this procedure. We also note and test the ability of the same 
algorithm to align the head if the degeneracy present is that of the motion of 
the camera (it rotates about its optical centre) rather than of the scene. 



1.1 Relationship to Previous Work 



Related previous work on self-calibration of stereo rigs can be found in P23E1 
II 6j . In jl the authors are concerned with obtaining affine calibration, that is 
to say, the location of the plane at infinity, under zero-pitch screw motions, occa- 
sionally also termed ‘planar motions ’■ 1231 extended this to general motion and 
full Euclidean calibration (ISEl gave further theoretical results and algorithms). 
PI used similar ideas from projective geometry but solved the related problem 
of self-alignment, as defined above. 

All of this work is predicated on the ability to compute projective structure 
and/or the (projective) locations of the cameras, by viewing a general scene. 
Here we concentrate on the problem of viewing a planar scene, but restrict 
ourselves, like HI2|Ej, to zero-pitch screws. We show that self-alignment using 
image invariants is simpler and more robust than that achieved by PI, and how 
it can for a typical stereo head be applied to any scene, planar or not. 

The most prominent and general work on planar calibration is clearly that of 
Triggs HU, but others have considered the same problem. By assuming various 
scene constraints, most usually known metric structure or known orthogonality, 
pilllHI22| all develop simple, flexible monocular calibration methods which are 
not reliant on initial guesses at the calibration values. We make no assumptions 
about the scene viewed, other than its planarity, but do require the special 
motions stereo heads provide. However note that in constrast to IBIldIRjl which 
describe methods for self-calibration of active pan-tilt devices, we do not assume 
that the absolute motion is known (e.g. exact angle of rotation about the optic 
centre), only that it is of the form of a zero-pitch screw. 
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1.2 Paper Organisation 

Section |2| of this paper establishes the geometric groundwork for the work and 
the associated mathematical theory, proving the results listed above. In m we 
present detailed experiments in self-alignment using both real and simulated 
data, and give results which validate the self-calibration theory. Conclusions are 
drawn in m 



2 Theory and Implementation 



2.1 Background and Notation 

Each camera is modelled by a linear, central projection model (so non-linear 
effects such as radial distortion are neglected) with its intrinsic parameters en- 
coded by an upper triangular matrix 



K = 



f S Uq 
0 af Vo 
0 0 1 



( 1 ) 



World points and planes are represented as homogeneous 4- vectors and writ- 
ten as upper-case bold characters {X, U), image points and lines are represented 
as homogeneous 3- vectors and written as lower-case bold characters (x, 1). Ma- 
trices are written in teletype (K, H). 

We denote left and right camera images by subscripts 1 and r, and different 
stereo pairs (henceforth referred to as views) by integers. So a homography 
mapping points from left to right would be denoted by Hjr, and from the left 
image in view 1 to that in view 2 by 

Any rigid transformation of a scene may be represented as a screw consisting 
of a rotation about an axis and a translation along the axis. The pitch of the screw 
defines the distance of this translation, and we are particularly concerned with 
the case where the pitch is zero. Note that zero-pitch screw motion should not 
be confused with degenerate rotation about the origin involving no translation 
whatsoever. The zero-pitch case permits translation orthogonal to the rotation 
axis. 

Scene structure can be classified as projective if it is known up to an ar- 
bitrary 3D homographic transformation of the ‘true’ Euclidean structure. Affine 
structure, for which the concepts of parallelism and mid-points make sense, can 
be obtained from projective structure using knowledge of the location of the 
plane at infinity. Affine structure may be upgraded to Euclidean using know- 
ledge of the camera intrinsics K or equivalently the image of the absolute conic 
(V = or its dual uj* = KK^. 



2.2 Motion Invariants 

Figure |2(a)| shows a camera rotating about some axis in space with zero pitch, 
as is the case when our binocular head rotates about just one of its axes. 
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(a) 



(b) 



Fig. 2. Planes and line at infinity invariant to rotation 



It is evident from the figure that during the motion, each member of the 
set, or pencil, of planes perpendicular to the screw axis rotates into itself and 
is therefore preserved under the motion. Since the planes are parallel, the axis 
of the pencil, the line of common intersection of the planes, lies on the plane at 
infinity iJoo, and is invariant to the motion. Likewise the rotation axis itself is 
invariant to the motion. 

Figure p(b)| shows the same situation with the camera represented as an 
optical centre and image plane. Every point on the invariant plane that passes 
through the optic centre projects to the same line on the image plane, dashed in 
the figure. Thus in the case of zero-pitch screw motion, whatever the relationship 
between images before and after the motion, it will have an invariant line which 
is the projection of the invariant line at infinity. 

Let us now restrict ourselves specifically to the case of a planar scene. The 
images before and after the motion are related by a plane-induced homography 
H. The invariant plane through the optic centre passes through the scene plane 
in an invariant line. There is also an invariant point, the intersection of the 
screw axis with the scene plane. These invariants have the following algebraic 
interpretation: 

The eigenvectors of the plane-induced homography H under zero-pitch 
screw motion consist of (i) a real eigenvector, which is the image of the 
intersection of the screw axis and the scene plane and (ii) a complex 
conjugate pair, which span the image of the invariant line on the plane, 
and so also the image of the invariant line at infinity. 

Images are also related by a 2D homography if the screw axis passes through 
the camera optical centre. Our argument for the invariant line is universally 
valid for zero-pitch screw motion, so this homography must also have a complex 
conjugate pair of eigenvectors which span the invariant line. In this case the real 
eigenvector becomes the image of the screw axis (which is imaged in a point). 
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2.3 Constructing Points at Infinity 

Two Motions, One Camera. We have shown that if a camera undergoes 
zero-pitch screw motion, we can calculate an image line invariant to the motion 
that is the projection of an invariant line at infinity orthogonal to the screw 
axis. If the camera now undergoes two such motions about different axes, the 
two resulting invariant lines will intersect in a point orthogonal to both axes. 
We can therefore conclude: 

A zero-pitch screw motion about two non-parallel axes sujfices to con- 
struct the image projection of a point at infinity whose visual direction 
is orthogonal to both axes. 

A direct consequence is that fixating this point will result in alignment of the 
camera’s optic axis orthogonal to the two rotation axes. This is the problem 
addressed in uni, but the new result holds for a single camera (of course it also 
holds for a binocular rig), for planar scenes, and does not require the computation 
of epipolar geometry and/or scene structure as an intermediate step meaning 
it is potentially more robust. We require only that we know the relationship 
between images before and after the motion (a fundamental matrix if there is 
no degeneracy, or a homography if there is) . 

We can therefore summarise the algorithm for self-alignment of a stereo pan- 
tilt head for the special cases of degenerate (planar) scenes, and degenerate 
motions (the camera rotates about its optical centre), for which the image rela- 
tionship is a homography: 

To self-align a stereo head in the presence of degeneracy: 

1. Obtain three images of the scene, one in the initial ‘wake-up’ position, one following 
a small rotation about the elevation axis, and one following a small rotation about 
the pan axis. The rotation must retain matchable image features between views, 
so between 2° and 10'’ is sensible for typical cameras. 

2. Calculate inter-image homographies between the three images. In our experiments 
we use robust matching of corner features between images, and a non-linear mini- 
misation to calculate H that minimises reprojection error. 

3. Eigendecompose each H and obtain the complex conjugate eigenvectors, v\ and 
V 2 - Each motion yields one invariant line Atiit, Apan, which are computed from the 
eigenvectors of the appropriate homography as A = (ui -|- V2) x i{vi — D2). 

4. Fixate the intersection of the lines x = Atiit x Apan. This can be done without 
calibration information by visually servoing to the point as in u^. 

Note that for a binocular system, the above algorithm can be performed for 
each of the left and right cameras separately. We verify the algorithm experi- 
mentally in section 0 

One Motion, Two Cameras. Our reasoning to this point is valid for a mo- 
nocular system. Now consider the case of two cameras on a stereo rig viewing 
the scene, and undergoing the same motion (so their relative positions remain 
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fixed), as in Fig. 0 For each camera there is a different invariant plane through 
the optic centre, but each intersects in the same line at infinity, so the invariant 
lines in each image are matched lines at infinity. This is true whether or not the 
scene is planar. 



invariant line at infinity 







(b) The images 



Fig. 3. Stereo view of a planar scene. Note the left and right invariant planes are 
parallel and perpendicular to the screw axis, but perspective has been increased here 
so that they intersect within the figure 



If the scene is indeed planar, the invariant planes intersect the scene in two 
parallel lines. These lines intersect on the vanishing line of the plane, where 
it meets the plane at infinity, its ‘horizon’. We can image both lines in each 
camera, as shown in Fig. |3(b)l by transferring each line to the other image using 
the homography Hi^. The intersection of these image lines, then, is the image of 
the point on the scene plane’s horizon. 

This brings up an interesting simplification of self-alignment when the scene 
consists of the ground plane. For a typical stereo pan-tilt head the pan axis is 
perpendicular to the ground plane. So the ground plane’s vanishing line is also 
perpendicular to the pan axis. All that is required is to make a single rotation of 
the head about the elevation axis, and construct the point on the scene horizon as 
above. This point is then both perpendicular to the pan axis, and to the elevation 
axis (it lies on the invariant line to the elevation), and so is the fixation point 
we require for alignment. 

Constructing this vanishing point will generally not work for a stereo rig 
undergoing a pan, since, typically, the camera centres of such a rig will lie on 
the same pan-invariant plane, making the left and right invariant lines images 
of the same line on the scene plane. 
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Two Motions, Two Cameras. Now if we have a stereo rig undergoing two 
motions there will be two invariant lines for each camera, and their intersection 
will be a matched point at infinity which is in general not on the scene plane. 

2.4 Combining Views for Planar Scenes 

The relationship between the left and right cameras is fixed during the controlled 
motions, and this enables us to treat the motions as, interchangeably, either 
motions of the cameras, or motions of the scene. As Fig. 0 shows, two views of 
one scene plane are equivalent to a single view of two planes that are (in general) 
not coincident. This suggests that we can combine image features from the two 
views as if they were from one view. Note that the scene is no longer degenerate 
so, for instance, the fundamental matrix can now be determined 




Fig. 4. Camera-centred representation of zero-pitch screw motion 



Obtaining Additional Vanishing Points. We observed that a single motion 
of a binocular rig in a planar scene (in most cases excluding a pan) enables us 
to construct the image of a point on the scene’s horizon, its intersection with 
the plane at infinity. For this we need an initial view (1) and an elevated view 
(2). If we also have a panned view (3) we can transfer this vanishing point from 
view 1 to this view using the planar homographies and Combining 

points in all three views, we now have one view of three planes, and a point on 
the vanishing line of each. Two of them will lie on the elevation-invariant line 
but the third will not. 

2.5 Recovering the Left-Right Infinite Homography, Hooir 

We have established that from three binocular views of a plane related by an 
elevation followed by a pan we can construct four matched points at infinity. One 
is the fixation point for alignment (' H2.dll . the others are points on the vanishing 
line of the plane in each view 1 ^12.411 . We know that Hooir is the same in every 
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view; therefore normally four matched points at infinity in the three views would 
be enough to calculate Hooir- Unfortunately, three of the points lie on the same 
line, the line invariant to the elevation. 

However the epipole is also a corresponding point for any plane-induced ho- 
mography and so may be used as the fourth point. The epipole can be determined 
from the planar homographies of the three views without computing the funda- 
mental matrix by observing that the epipole in the right image must be invariant 
to the matrix hr. |E|- The epipole is the non-degenerate eigenvector of this 
matrix. 

Further rotations and further views can help provide a more accurate esti- 
mate of Hooir : wherever a view is related to another by an elevation, a point on 
the plane’s vanishing line in that view can be calculated, and transferred into 
other views using plane-to-plane homographies. We have tested a four view me- 
thod involving one pan and two elevations which results in a total of six distinct 
matched points at infinity. 

2.6 Finding the Camera Internal Parameters 

It is well known that the infinite homography Hqo , which maps points at infinity 
between two images i and j, provides a number of constraints on the images of 
the absolute conic for those cameras oji j, and hence on the internal parameters 
caHEI.A number of calibration methods take advantage of these constraints 
directly (cameras related by rotation R): 

Hoo = 

=> LOj = ■ 

If Ki = Kj then one constraint on K such as zero skew (<5 = 0 in (0) ) enables us 
to solve for the four remaining unknowns, however in general one cannot assume 
that the cameras will have identical calibration. 

In order to use the infinite homography constraint effectively in our case 
where we have Hooir, we solve for the two focal lengths, constraining all other 
parameters to typical values (i5 = 0, (uo,vq) = the image centre, and a = 1 in 
O). Since these are reasonable assumptions for typical cameras, they do not 
impact greatly on the accuracy of the focal length estimates, and can be relaxed 
during a subsequent bundle adjustment phase if required. As they stand, inexact 
camera parameters can still be used in many tasks such as aiding rapid fixation 
of a stereo head. ^ shows how to apply the calibration constraints in a simple, 
linear manner. 

2.7 Obtaining Projective Structure and the Plane at Infinity 

Although in general our aim has been to achieve as much as possible without 
computing the explicit epipolar geometry, there is nothing to stop us combining 
views to this end. After combination we have one stereo view of two (or more) 
planes, and we can therefore compute the epipolar geometry, using for example. 
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Pritchett’s method m (see The fundamental matrix can be computed 

from any left-right plane-induced homography and the epipole e, as F = 

X CZl. 

Projective camera matrices can be computed from F and scene structure ge- 
nerated by backprojection. Unsurprisingly, experiments show that the structure 
generated by this method is of reasonable quality as long as the rotations are as 
large as possible. 

Naturally we can also backproject the points and lines at infinity we have 
constructed. From three views we will then have a sufficient number of these to 
calculate the plane at infinity. For example, it is the null space of the three points 
on the vanishing line of the scene plane in each view. Knowledge of the plane at 
infinity enables us to update projective structure to affine, and to Euclidean if 
we include the calculated calibration. 

Given projective structure it might appear we could proceed with the 3D 
alignment and calibration method of PEcni. Unfortunately, the 3D relations- 
hips between different views can only be calculated if scene structure is non- 
degenerate; for planar data they are underconstrained. Even if it were possible 
to proceed with the 3D method, it would not take full advantage of the constraint 
that the images are related by 2D homographies. 

3 Results 

Lens distortion was not modelled in the simulations and was corrected for in 
real data tests. For our cameras the correction in some cases helped to improve 
the number of point matches obtained, but in other respects the benefit to the 
accuracy of results was small. 



3.1 Alignment 

The alignment algorithm was tested both in simulation and for real data. For 
the simulation, a cube or plane was generated at random orientations at a fixed 
distance from the cameras consisting of 200 points at random positions, and 
the cameras were given random elevation and vergence (within bounds). The 
rotation angles were 3°. The results are shown in the graphs of Fig. El The 
graphs show the standard deviation of the vergence error ag and that of the 
elevation error tr^ plotted against noise standard deviation cr„ in pixels. Each 
plot shows results for a general planar scene, the ground plane method of H2..SI 
and the motion degeneracy case where the scene is general but we assume the 
head axes pass through the camera optical centres (so the degenerate alignment 
algorithm still applies). We also examined the effect of varying the number of 
feature points available to the algorithm from 200, with an fixed at 0.5 pixels. 

The results are encouraging. In real images, noise is rarely over 1 pixel stan- 
dard deviation, yet we find the error stays below 1°, which is certainly good 
enough for most tasks. The two- view method off the ground plane seems to 
work almost as well as the general method, with a worse elevation error but a 
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number of point matches 



Fig. 5. Alignment simulation results 



similar vergence. This is as expected, since the vergence calculation comes from 
the same technique as the general planar method, while the elevation accuracy 
relies on the intersection of two image lines which may well be close to parallel 
and so poorly conditioned (the more distant the plane, the more parallel the 
vertical invariant lines will appear to each camera). 



The algorithm is sensitive to an insufficiency of matched point data, but 
accuracy is still better than 1° with fairly typical image noise (0.5 pixels) when 
there are just 50 point matches. 



The algorithm was tested on the real scenes of Fig. El and the mean error over 
several tests with different initial orientations of the head (some of which were 
quite extreme) was taken. Figures 6(a) and 6(b) show the non-degenerate scenes 
tested, the first having greater depth of structure compared to scene distance, 
meaning a 2D homography would fit the data less well because of the offset 
between the rotation axes and the camera centres. Figures |6(c)| and 6(d) show 
the scenes tested for planar alignment, the latter being the ground plane, which 
was also tested on the two view ground plane alignment algorithm. The lines in 
the images are the invariant lines. 
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(a) Real scene with good scene depth 



(b) Real scene with little scene depth 





(c) General planar scene 



(d) Ground plane 



Fig. 6. Real scenes for alignment tests 



Table 1. Alignment results for real data. (j> refers to elevation, 6\ and 9^ to left and 
right vergence. Results were obtained by averaging performance over a number of tests. 



Figure 


Algorithm 


Fixation Point 


0 error (°) 


6i error (°) 


error (°) 


|6(a)| 


Motion degeneracy 


in image 


0.40 


0.29 


0.34 




(good scene depth) 


off-image 


1.58 


1.41 


1.26 




Motion degeneracy 


in image 


0.52 


1.58 


0.61 




(less scene depth) 


off-image 


2.40 


1.68 


0.80 


|6(c)| 


General planar 


in image 


0.78 


0.89 


0.41 






off-image 


0.86 


2.12 


0.93 




General planar 


off-image 


2.09 


2.13 


0.70 




Ground plane 


off-image 


4.71 


1.00 


0.63 
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Table □ shows the alignment results for the real data tests. We have sepa- 
rated the results into mean error when the fixation point was within the image 
and when it was not. This is because fixating a point off-image without prior 
knowledge of camera calibration incurs additional inaccuracy unrelated to the 
alignment algorithm. In the case of the ground plane, the fixation point was 
always off-image, leading to the increased error shown. 

Taking this into account, the results seem to agree with simulation, and 
suggest the algorithm is genuinely practical. Most interesting, perhaps, is that 
for our head, which is fairly typical, the degenerate method can be used with 
good accuracy even where there is considerable structure in the scene because 
of the near degenerate motion of the head (the head axes pass close enough 
to the cameras to assume the cameras rotate about their optical centres). We 
can therefore calculate sufficiently accurate homographic image relationships 
between views for the algorithm to work well. 

3.2 Calibration 

Calibration tests were carried out on simulated planar data generated in the 
same way as in the alignment tests. Figure 0 shows the mean focal length error 
6f for each camera as a percentage of the true value. 




Calibration was tested on the real scene of Fig. |6(c)| We were able to obtain 
focal length estimates with errors of 16% for the left camera and 15% for the 
right. Generally, the results were biased towards an overestimate, and somewhat 
worse than the simulation suggested. Presumably this has to do with the inac- 
curacy inherent in point matching, a problem which may be solved by the use 
of direct methods to calculate the homographies. Also, due to the limited size 
of our scene plane we were forced to restrict the angle of rotation of the head 
between views, with consequent reduction in the accuracy of the epipole calcu- 
lation (see the comments of EZJ- The performance could be refined with some 
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sensible adjustments to the algorithm (for instance rejecting results unless there 
were sufficient point matches) , and of course additional views would increase the 
accuracy. 

As expected, using just three views this linear algorithm is not accurate, but 
it does definitely provide an adequate starting point for bundle adjustment. For 
calibration we used larger rotations (8°-10°). Tests showed both alignment and 
calibration improved the larger the rotation, as long as lack of overlap did not 
cause too few matches between views. This problem could be overcome by using 
intermediate images. 

4 Summary and Conclusions 

We have analysed the invariants and constraints imposed by a stereo pan-tilt 
head rotating about its axes individually, in a planar scene. We have shown 
how to construct point matches at infinity in spite of the scene degeneracy, and 
how to use these to calculate the infinite homography between the left and right 
camera images given three binocular views of the scene plane, two related by 
a pan motion, and two related by an elevation. This homography can then be 
decomposed to give the focal lengths of each camera. Experiments show these 
focal length estimates to be sufficiently good to be used in a variety of contexts. 

One of the points constructed at infinity lies in a direction orthogonal to both 
head axes. We have used this for self-alignment of the cameras, and shown the 
algorithm working well in real scenes. In addition we have noted that for a typical 
stereo head where the rotation axes lie close to the camera optical centres, the 
relationship between camera images during a rotation will be homographic even 
in non-degenerate scenes, and the self- alignment algorithm will therefore work 
equally well. Our experiments have shown that for typical scenes, alignment 
using this degenerate method works as well as the non-degenerate method of 
PCD!, with the advantage of greatly reduced computational expense. 
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Abstract. Reconstructing the scene from image sequences captured by moving 
cameras with varying intrinsic parameters is one of the major achievements of 
computer vision research in recent years. However, there remain gaps in the kno- 
wledge of what is reliably recoverable when the camera motion is constrained to 
move in particular ways. This paper considers the special case of multiple came- 
ras whose optic centres are fixed in space, but which are allowed to rotate and 
zoom freely, an arrangement seen widely in practical applications. The analysis 
is restricted to two such cameras, although the methods are readily extended to 
more than two. 

As a starting point an initial self-calibration of each camera is obtained inde- 
pendently. The first contribution of this paper is to provide an analysis of near- 
ambiguities which commonly arise in the self-calibration of rotating cameras. Se- 
condly we demonstrate how their effects may be mitigated by exploiting the epipo- 
lar geometry. Results on simulated and real data are presented to demonstrate how 
a number of self-calibration methods perform, including a final bundle-adjustment 
of all motion and structure parameters. 



1 Introduction 

A configuration of cameras which occurs commonly in a number of imaging applications 
is that of multiple well-separated cameras whose optic centres are fixed in space, but 
which freely and independently (i) rotate about their optical centres and (ii) zoom in and 
out. This arrangement is used in surveillance and in broadcasting (particularly outside 
broadcasting), and is a pattern for acquiring models for virtual and augmented reality, 
where full or partial panoramas are taken from different positions around a building, for 
example. 

What are the ways of handling the combined imagery from, say, two such uncalibrated 
cameras to recover a Euclidean reconstruction of a static scene? The least committed 
approach might be to generate a projective reconstruction, enforcing the zero translation 
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constraint within the images from a single camera, but then using overall a self-calibration 
algorithm for general motion, such as those of Pollefeys et al. Hill . Heyden and Astrom 
0 or Hartley et al. Gl . A practical disadvantage is that general motion methods require 
locating the plane at infinity, but a broader criticism is that the motion is far from general. 

The most committed (and perhaps most obvious) approach is to self-calibrate each 
rotating camera independently, for which methods have been described in the literature 
ldlZll2ll3r . The task of reconstruction is then reduced to the more familiar one of 
structure from multiple views using calibrated cameras. Although dealing with each 
camera separately is attractive since it reduces the problem to a set of smaller, less 
complex ones, this method is likely to give poor results if the initial self-calibration is 
inaccurate. Moreover it is clearly not using all the available information. 

The results in this paper provide two pieces of information which inform the solution 
from the spectrum of those available. 



- There are some near-ambiguities in self-calibration of rotating cameras which can 
have a large effect both on the camera intrinsics and a reconstruction obtained from 
them. These effects are present when the self-calibration problem is ill-conditioned, 
in particular with small motions, large focal lengths, short image sequences and a 
poor spread of image features. They can be mitigated by modelling them correctly. 

- Modelling the appropriate degree of inter-camera coupling is desirable. It proves 
useful to exploit the epipolar geometry not only to recover the relative positions 
of the two cameras, but also to refine the self-calibration of both sets of intrinsic 
parameters. 



Two very different measures are used to characterize performance. The rms re- 
construction error measures the distance between points in two rescaled and aligned 
Euclidean reconstructions. The rms reprojection error measures the faithfulness of re- 
construction in the image: a low value implies that coplanarity and collinearity are well 
preserved, but it provides little information regarding the preservation of angles. 

Results from any point in the spectrum of solutions may always be used to initialize a 
bundle-adjustment over all scene points and motion parameters, minimizing reprojection 
error. However, in addition to its cost implications for on-line use, bundle-adjustment is 
susceptible to convergence to local minima. The latter is critical in this context where a 
number of near-ambiguities are present since bundle-adjustment tends to make only very 
small changes to the motion parameters. Hence even if the reprojection error is reduced, 
there is by no means a guarantee of a significant change in reconstruction error. One 
goal of this work is to find algorithms which are good enough either to make bundle- 
adjustment unnecessary or to provide better initial estimates to increase the chance of 
convergence to the correct solution within it. 

After introducing briefly the theory of self-calibration of rotating and zooming came- 
ras in SectionElwe investigate precisely what information can and cannot be reliably ex- 
tracted from such algorithms in SectionOl In particular we describe two near-ambiguities 
which commonly arise. In Section^ we review the structure from motion algorithm of 
two calibrated views which we modify to resolve the ambiguities while minimizing epi- 
polar transfer error. Experiments on synthetic and real data are presented in Section El 
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2 Self- Calibrating Rotating and Zooming Cameras: Review 



The imaging process is modelled by the pinhole camera model so that in the ith image, 
the projection of a point X in the scene is described by the relation = PjX where 
Xi and X are both given in homogeneous coordinates, implying that all vectors, matrices 
and equations are only defined up to an unknown scale factor. is a 3 x 4 projection 

matrix which may be decomposed as P^ = (R^ t^) where and describe the 

transformation between a coordinate frame attached to the scene and a camera centred 
coordinate system. is the matrix of intrinsic parameters in image i and has the usual 
form 



( OCu S Uq \ 

0 i>o . (1) 

0 0 ij 

au and are the focal lengths in the u and v directions, {uq Vq) are the coordinates of 
the principal point, and s is a parameter that describes the skew between the two axes 
of the CCD array. 

In the case of a camera rotating about its optic centre, t = 0, the final coordinate 
of X = {X Y Z 1)^ is immaterial, and the projection equation simplifies to x^ = 
K^R; {X Y Z)^ . Different images taken from the same rotating camera relate to each 
other by homographies which take the form 

Xj = EijXi = KjRjRTlK“^Xi = KjRijK~^Xi . (2) 

The inter-image homographies may be calculated directly from image measurements, 

for instance from point or line correspondences. Various techniques for this calculation 
are available, ranging from fast linear methods minimizing an algebraic error, via non- 
linear methods which minimize the geometric transfer error, to a bundle-adjustment in 
the motion and structure parameters, the structure comprises points on a mosaic. 
Eliminating R^ from equation Q yields 



(KjK/) = Hy (K,K/) H,/ , (3) 

which can also be derived by projecting a point on the plane at infinity, X = {X Y Z 0)J 
into a camera with a non-zero fourth column in P^. The observed inter-image homogra- 
phies Hjj are thus the homographies induced by the plane at infinity, and equation (0 is 
known as the infinite homography constraint. u>* = is the dual of the image of 

the absolute conic (DIAC). 

Given the homographies, H^ , equation © provides constraints on the intrinsic pa- 
rameters. If the camera intrinsics are constant throughout the sequence, the constraint 
reduces to that of Hartley in @, the DIAC may be computed linearly, and the matrix K 
is found from it by Cholesky decomposition. 

For varying intrinsics, de Agapito et al. [2)| solve equation m in a manner similar 
to that of Pollefeys et al. M for cameras undergoing general motion. In a non-linear 
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optimization the cost function 



n 



V = Y,\\ - Ho^KoKo^Ho/ III 



(4) 



is minimized, where the elements of K^, i = 0...n, are the unknown parameters. To eli- 
minate the unknown scale factors, and HoiKoKo^Hoi^ are normalized so that their 
Frobenius norms are equal to one. An advantage of this approach is that any constraints 
on the intrinsic parameters, such as zero skew or known aspect ratio may be applied 
directly. Alternatively, parameters such as the aspect ratio or principal point can be sol- 
ved for, but constrained to be constant throughout the sequence. A similar approach was 
adopted by Seo and Hong |)12|. but under known skew and principal point they note that 
the focal lengths can be computed linearly from equation Q. 

In a later work 0 de Agapito et al. proposed a fast linear method for calculating all 
intrinsic parameters by employing an algebraic trick, used previously in another context 
by Armstrong et al. [[]]. They dealt not with the DIAC, but with its inverse, the image of 
the absolute conic (lAC), uj. Under the assumption of zero skew the lAC is given by 



Inverting the infinite homography constraint, ujj = Uij~^ , provides linear con- 
straints on the lAC in frame i by setting the ( 1 ,2) element of u>j to zero. Further constraints 
are available from additional assumptions on the intrinsic parameters, in particular, a 
known aspect ratio and/or a known principal point. 

Most recently, optimal results have been obtained by de Agapito et al. 0] by perfor- 
ming a final bundle-adjustment in the motion and structure parameters. 

2.1 Recovering Rotation Matrices and Enclidean Projection Matrices 

For reconstruction. Euclidean projection matrices of the form = (P^ 0) = (K^Ri 0) 
are required. The 3x3 left sub-matrices. Pi, are recovered from the projective homo- 
graphies as HoiKo. Rotation matrices, referred to the initial frame, may be found by QR 
decomposition of Pi. 

A more direct approach for finding rotations would be to use the recovered Ki matrices 
directly in the equation Ri = K“^HoiKo. However, it can be unwise to apply this equation 
in combination with the non-linear self-calibration method of 0, especially when the 
principal point is constrained to be constant throughout the sequence in the minimization. 
The reason is that with fewer parameters in the model, the R^ recovered from K“ ^ Hpi Kq are 
less close to orthonormal. Even fitting an orthonormal matrix to R^ by setting the singular 
values of its SVD to unity does not guarantee that this rotation matrix is the correct 
one, especially since it is an algebraic error (a Erobenius norm) that is minimized when 
projecting K“^HoiKo onto the 3-dimensional space of orthonormal matrices. This method 
could therefore give poor motion recovery and have dire consequences for Euclidean 
reconstruction. Rays would be back-projected incorrectly, and a large reprojection error 



UJ = 




(5) 
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ensue. In this work we therefore adopt the approach based on QR decomposition when 
using the non-linear self-calibration algorithm. 

Since the linear self-calibration method of H is not parameterized directly in terms 
of camera intrinsics, it does not suffer from the problems of non-orthonormal matrices, 
and the two approaches for recovering are equivalent. 

With pan-tilt cameras rotations are described by two rather than three parameters. A 
practical treatment of the decomposition of rotation matrices into these two parameters 
is provided in |3 . 

3 Ambiguities in Self- Calibration 

Self-calibration is an ill-conditioned problem. Signihcant advances have been made since 
the work of Maybank and Faugeras, but there are a few underlying ambiguities which 
can have a large effect on results in configurations which poorly constrain the solution, 
coupled changes in the parameters in the model are barely observable. We consider two 
ambiguities present in the case of rotating cameras. It would be more correct to call 
these near-ambiguities: as opposed to true ambiguities which arise from certain motions 
and scenes nnnzi, ours are only apparent because perspective effects in the cameras 
are less prominent under some camera configurations. A discussion of their relevance to 
reconstruction, motivated by experimental results, is provided in Section El 

3.1 The Ambiguity between the Angle of Rotation and the Focal Length 

For small rotations there is an ambiguity between the rotation and the focal length, and 
it is difficult to distinguish between small rotations with a large focal length and larger 
rotations with a small focal length. The ambiguity is easily seen by differentiating the 
calibrated non-homogeneous projection equation X = (a/ZjX. Remembering that there 
is no translation, and secondly that the focal length a is a function of time, differentiation 
yields the following image motion in the x-direction 

X QX 

X = +afly — y^z H (xf^Y — y^x) H , (6) 

a a 

where x and X are expressed in camera centred frames and 17 is the angular velocity. 

Cyclorotation and the relative change in focal length can be recovered from the terms 
—yf^z and axja respectively (the latter is zoom-induced looming motion). However, 
the first term o:f7y, a uniform motion in the image due to the component of rotation 
perpendicular to the optic axis, contains an ambiguity between focal length and rotation. 
The third term {x / a)(x^2Y — y^x), which also arises from the component of rotation 
perpendicular to the optic axis, provides some disambiguating information, but the term 
is likely to be small except at the edges of the image. Unfortunately this is also where the 
optical properties of the lens are poorest. Notice too that the disambiguating information 
is weakest for large focal lengths. Compounding these difficulties is that in practical 
applications, sequences taken at large a are less likely to contain significant rotation. 
Since motion is being integrated over time, this ambiguity persists over a sequence of 
images. 
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In experiments we find that the ambiguity is much more pronounced when the prin- 
cipal point is allowed to vary in the self-calibration algorithm: with more parameters, the 
model is more likely to fit to the noise rather than the underlying true solution. However, 
if the sequence is ill-conditioned, the ambiguity is also noticeable even if the principal 
point is constrained to a constant location. 

3.2 The Principal Point/Rotation Ambiguity 

A similar analysis (again using the a;-dimension of the image motion) shows that it is 
difficult to distinguish between a shift in the principal point along x and a rotation of the 
camera about y. If Su is the error in the estimation of the principal point, and a is the 
focal length, the erroneous rotation is ^ 6u/a about y. This is an ambiguity between 
parameters from a single image. 

Another way of describing this ambiguity is that a large focal length perspective 
projection is hard to distinguish from a spherical projection where the principal point is 
meaningless. 

3.3 Experiments 

Figures in and 121 illustrate these ambiguities using both simulated and real image data. 
The “bookshelf” sequence m, was gathered by zooming while moving the vergence 
and elevation axes of one camera of a stereo head (equivalent to pan and tilt, up to an 
ordering of the the kinematic chain) so that the optic axis traced a right-circular cone. 
Point features were detected and matched, and homographies derived. Figure 1 shows 
the resulting mosaic. Simulated point data were synthesized similarly. Levenberg-Mar- 
quardt was used to minimize T> in equation (0J| allowing the principal point and focal 
length to vary over the sequence during minimization. (That is, in the minimization there 
are different values to be found for each frame, rather than a single value to be found for 
the whole sequence.) 

In each set of results the first two plots show the recovered and veridical focal length 
and principal point. The + symbols in the third plots show the recovered camera motion 
in terms of elevation and vergence angles. These are roughly circular, but there is a good 
deal of scatter about the best-fit circle. 

In earlier work it was supposed that this scatter arose from noise El- However it 
turns out to be almost entirely due to the principal-point/rotation ambiguity. Using the 
ground truth value for the position of the principal point, the elevation and vergence 
angles are corrected and re-plotted as x symbols. These form near perfect circles. 

However, the scale of the motion is still incorrect. This is due to the ambiguity 
between focal length/motion. Table D] illustrates this point with the recovered scale of 
focal length and motion compared to the ground truth: multiplied together they give a 
number very close to unity. 
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Fig. 1. A Mosaic constructed from the bookshelf sequence during which the camera panned and 
tilted while the lens zoomed. 
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Fig. 2. Correcting the elevation and vergence angles by accepting the principal-point/rotation 
ambiguity. Parts (a) uses simulated data, (b) real data. Both sequences use a linearly increasing 
focal length and motion with cone half angle 3° . 

Table 1. Verification of the ambiguity between focal length and motion exhibited in Figure|3 
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4 Improving Self- Calibration via the Epipolar Geometry 

We now turn to the second theme of the paper; using the appropriate degree of coupling 
between the two rotating cameras to improve the self-calibration of each, and then also 
improve the reconstruction. The methods utilize epipolar geometry and it is convenient 
first to review the work of, among others, Longuet-Higgins IflTB and Zhang 111 (SI . 



4.1 Stereo from Calibrated Cameras: Review 

The geometry of two calibrated views is encapsulated in the essential matrix, E |j]U|. 
Corresponding image points x and x' in the first and second views are related by x'^ Ex = 

0 where E = [t]xR, and [t]x is the skew-symmetric form of the translation vector, t, 
describing the location of the optic centre of the second camera in the coordinate frame 
of the first. R is the relative rotation of the two cameras. E has hve degrees of freedom, 
three for the rotation and two for translation up to scale. Since [t] x is rank two, so is E, 
and the nullspace of E is t. R may be recovered from E and t using quaternions. 

The solution is refined with an algorithm due to Zhang ITT^ which uses the uncali- 
brated image measurements directly. For uncalibrated views the fundamental matrix, F, 
plays a similar role to E, and the two matrices relate as F = K'“^EK“^ where K and K' 
are the intrinsic parameters for the first and second camera respectively. F is calculated 
directly from image measurements. An initial estimate of F is provided by the linear 8- 
point algorithm. The fundamental matrix is refined by minimizing a cost function with 
geometric significance, the distance between points and epipolar lines, 

£ = Vd2(x'",Fx'=) -F d2(x^FTx''=), d(x^FTx'") = ^ ^ 

k 0Fx''=)f + (Fx''^)! 

(7) 

where the superscripts denote a particular point correspondence and (Fx'^)j is the jth 
component of the vector (Fx'^). Thus, given F and the calibration matrices, E may be 
recovered and decomposed. The five parameters in R and t are then refined using the 
same geometric measure as above, t is parameterized by a point on the unit sphere, and 
R by a rotation vecto£|. 

Having computed R, t and the self-calibration of each camera the scene may now 
readily be reconstructed, using not just the images from one stereo pair, but also further 
images. The projection matrices from the first and second cameras, P and P', in images 

1 and i' respectively take the form 

P* = K,R, (I 0) and P', = K',R', (R t) . (8) 

3D points are found by the intersection of rays back-projected using these camera ma- 
trices. We will evaluate this algorithm in the experimental section. 

* Zhang also performs a final bundle-adjustment over these five parameters and the 3D structure. 
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4.2 Constraints from the Epipolar Geometry of Two Rotating Cameras 



The method given above uses only a single stereo pair to compute R and t and is clearly 
discarding a lot of information. Although we now have sufficient information to obtain 
a Euclidean reconstruction from the entire sequence, the result will be heavily biased 
towards the first pair. 

Besides, since the fundamental matrix has seven degrees of freedom, and the essential 
matrix only five, if is possible fo solve for fwo further parameters in K and K' just from 
the single pair. This is indeed done by Hartley in 15! and Pollefeys et al. in |1 1| | who 
use linear methods to solve for the focal lengths assuming the principal points, aspect 
ratios and skew are known. However, in our case the special geometry may be used to 
greater effect by relating additional frames in either sequence to the original frame via 
the inter-image homographies. 

We now write the epipolar constraint between correspondence k in image i from the 
first camera and image i' from the second as 



jv ,/ r 7. 



^xf = 0 . 



(9) 



As before, quantities without a dash refer to camera 1 and those with a dash to camera 
2, subscripts relate to the frame number and superscripts to point correspondences. 
Choosing a reference frame from either camera gives 

Foo = K'o-^[t]xRKp^ . (10) 

The fundamental matrix between two further images i and i' from each rotating camera 
relate to Fqo as 

F,,. = H',^TFqoH-i . (11) 

Parameterizing ¥ a/ in terms of Fqo, points from several image pairs are used to refine 

our estimate of Fqo, and thus also R and t. This is the second reconstruction algorithm 
we will investigate. The cost function minimized is the sum of epipolar distances over 
all measured points and also all images pairs, 

.7^ = ^^d^(x'Ji,Fi,-xf) -Fd^(xf,F,i/^x',v) . (12) 

i,i' k 

Any combinations of i and i' may be chosen, provided image correspondences are avai- 
lable. Since iF is a cost function with geometric significance there is a strong correlation 
with the reprojection error, but it is not the optimal error. 



4.3 Improving Self-Calibration and Reconstruction 

Now, since Hj = K^R^Kq ^ we have that 

F,,. =K',-TR',[t]xRR.^K-i . (13) 

Thus, estimates of (i) the relative camera positions R and t in a reference frame, (ii) 
the intrinsic parameters in both cameras at each frame in the sequence, and (iii) the 
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rotations between frames within sequences from either camera, yield an estimate of the 
fundamental matrix F^/ between further frames of cameras 1 and 2. The goodness of 
this modelled fundamental matrix may then be measured with the cost function T in 
equation (O. Not only may further image pairs be used to provide further constraints 
on R and t, further parameters may be solved for. Effectively we are constraining the 
inter-image homographies together with fundamental matrices. This insight provides 
the basis of the methods we derive for improving the self-calibration, and thus also the 
reconstruction. 

We now introduce two methods of self-calibration refinement, depending on which 
ambiguities of Section 0 we wish to resolve. To parameterize the unknowns we write 
the true matrix of intrinsic parameters as 

( ^di 0 (mo)A 

0 ^di {vo)i (14) 

0 0 1 / 

where di is the measured focal length recovered from independent self-calibration of the 
rotating cameras, and where /3 is the unknown overall scale factor of the focal lengths 
of this camera over the entire sequence. Skew is assumed to be zero and the aspect ratio 
to be either known from the outset or recovered during self-calibration of each rotating 
camera. We also assume in both methods that the rotation matrices within a single camera 
have only two degrees of freedom, taking the form 

R = R,y(0)R,(<^) . (15) 

This is justifiable since pan-tilt cameras are restricted to this kind of motion (the ordering 
of Ry and R^; depends on the particular kinematic chain). 

Method (1) deals only with the ambiguity between focal length and angle of rotation 
described in Section im Thus we solve for seven parameters, five for the motion and 
two for the overall scale of the focal lengths, f3 and [3' . The true principal point (uq, vq) 
is assumed to be known from the self-calibration. The method is predicated on the 
assumption that rotations are small enough to model the ambiguity between focal length 
and rotation by requiring the true rotation matrix R^ to relate to the measured angles 9i 
and (f>i as 

R, = Ry{f39M(3$^) . (16) 

Method (2) seeks also to resolve the ambiguity between principal point and motion 
described in Section 14.21 and thus the number of parameters is 7 -F 2n -F 2n' where 
n and n' are the number of images from the two cameras. Method (2) models R^ by 
subtracting the erroneous motion caused by the ambiguity between rotations and motion 
of the principal point, 

R* = Ry (/3 

where (uo,vo) is the measured principal point from self-calibration of a rotating camera 
whereas (uq, vq) is its true value. The idea behind method (2) is based on the experimental 
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results of Section rOl where erroneous motion of the principal point is removed and the 
ambiguity between focal length and motion accounts for the remaining discrepancy from 
the ground truth. 

4.4 Implementation Issues 

Combining the information from two types of input, namely homographies and epi- 
polar geometry, in order to provide accurate self-calibration and reconstruction places 
emphasis in methods (1) and (2) on retaining as much information from the initial self- 
calibration as possible. Two important issues are therefore initialization and applying 
priors. 

In our current implementation an initial estimate of, /3, and similarly /3', are obtained 
by re-solving for only this single parameter in the non-linear self-calibration method. The 
prior is then found by investigating the curvature matrix J, where J is the Jacobian. In 
this case J is a 1 x 1 matrix. In fact, experiments with a prior chosen more arbitrarily, 
and with j3 and j3' initialized at unity, also worked well. 

Furthermore, if the principal point was allowed to vary in the initial self-calibration, 
the correction devised in section [Ql mav be applied to initialize the principal point and 
motion in refinement methods (1) and (2). However, that example used ground truth of 
the principal point in the correction. Since such information is not available here, we 
initialize the principal point either at the centre of the image plane or with that obtained 
from the non-linear self-calibration method where the principal point is maintained at a 
fixed but unknown value throughout the sequence. 

In our experiments we noticed that method (2) converges much more slowly than 
method (1). Therefore we choose only to use method (2) to refine the output from method 
(1). 

4.5 Refining the Solution Using Bundle-Adjustment 

The motion and structure parameters may be refined using a large non-linear minimiza- 
tion over all parameters, making use of the sparse form of the Jacobian. The cost function 
for the optimization is the reprojection error over all points and views, 

E E ii^-K(R t)xf ( 18 ) 

views points 

which provides a maximum likelihood estimate of the structure and motion. Each point 
X in the structure has either two or three degrees of freedom depending on whether it is 
visible from both cameras or only a single camera. 

Bundle-adjustment is thus guaranteed to reduce the reprojection error, but not neces- 
sarily the reconstruction error. Of course the reconstruction gained is a valid Euclidean 
one in the sense that the projection matrices have the required form if parameterized as 
P = K(R t), but it may easily “look” more projective than Euclidean in that angles are 
skew, and length ratios are not preserved correctly. It would be naive to expect bundle- 
adjustment to automatically cope with the inherent ambiguities which are present, the 
more so as it is prone to convergence to local rather than global minima. The parameters 
tend to change only by small amounts, and the final set of parameters differ little from 
the initial estimate. 
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5 Experiments and Results 

Experiments were conducted first on simulated data to allow controlled investigation of 
the sensitivity of the reconstruction techniques to varying noise, and varying separation 
of the two cameras. The data were generated so as to correspond roughly with later 
experiments on real imagery. The image sizes were 384 x 288 pixels, and one camera 
had a focal length ranging from 1000 - 1870 pixels and a circular motion in the elevation 
and vergence axes of 4° . The second camera had a longer focal length, 1250-2120 pixels, 
and a smaller circular motion of 3°. The principal point used to generate the data moved 
between frames with an overall motion of approximately 20 pixels. 

The self-calibration and reconstruction algorithms are summarized in Table|21 

Result 1. The principal result is that a significant improvement can indeed be ach- 
ieved by our method of refining the self-calibration using epipolar geometry. In Figure 
El we compare both the reconstruction error and the reprojection error as a function of 
image position noise with no refinement of the self-calibration (using single and multiple 
views to calculate R and t); and with refinement using method (1) of Section|Elwhich 
only handles the focal length/rotation ambiguity. Priors on the scale factor were obtained 
automatically from the method of Section ^31 

Result 2. The performance of the algorithms with varying separation of the two 
cameras is shown in Figure 0 As before, a significant improvement may be obtained 
with our novel methods, especially when the cameras are close together, pointing in a 
similar average direction. 

Whereas Result 1 used linear methods of recovering the homographies, and the linear 
method of initial self-calibration, and Levenberg-Marquardt for the minimization of the 
refinement cost function, and so is the fastest approach, this second experiment explores 
the other extreme. It uses bundle-adjusted homographies, the non-linear (LM) self- 
calibration, non-linear refinement, and finally bundle-adjusts the entire solution, solving 
for the focal length, principal point and two rotation parameters per camera, assuming 
square pixels. Furthermore, a longer image sequence (30 rather than 20 images) and 
more point correspondences (300 rather than 50 matches between images) were used. 
Again the refined method works much better than non-refined, and adding a final bundle- 
adjustment gives only a small further improvement. Notice that the results from method 



Table 2. The algorithms evaluated in the experiments. 



Algorithm outline 


Description 


Label used in keys of graphs 


A. Self-calibrate each camera individually 




B. Compute R and t from a single image pair 


No refinement, single image pair 


OR compute R and t from multiple image pairs 


No refinement, multiple image pairs 


C. Refine solution from B by resolving focal length/rotation ambiguity 


Method (1) 


D. (optional) Refine solution from C by resolving focal length/rotation 


Method (2) 


and principal point/rotation ambiguities. 




E. (optional) Bundle-adjustment, initialized at above solution 


Bundle-adjustment 
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Homographies: linear method. Self-calibration: Linear method, known pp 




Homographies: hnear method. Self-calibration: Linear method, known pp 




Fig. 3. The reconstruction and reprojection errors of different levels of noise, showing that the 
refinement of the self-calibration using epipolar geometry provides a significant improvement. 
(The angle between the principal directions of the cameras was fixed at 20°. ) 



Homographies: Bundle— adjusted. Self-calibration: LM, varying pp 




Average relative angle between cameras (degrees) 



Homographies: Bundle-adjusted. Self— calibration: LM, varying pp 




Average relative angle between cameras (degrees) 



Fig. 4. The performance of the reconstruction techniques for different relative positions of the two 
cameras. Both cameras perform small rotational motions about some initial direction, the angle 
between the principal directions of the two cameras is plotted on the o;-axis. In this experiment 
the noise was constant at ct = 0.5 pixels. 



(2), which handles both ambiguities (focal length/rotation and principal point/rotation) 
is virtually indistinguishable from those from method (1) which handles only the former. 

Result 3. Figure El demonstrates the sensitivity of bundle-adjustment to the initial 
estimate. A Euclidean bundle-adjustment is initialized with the output from the initial 
self-calibration, first with varying and then with fixed principal point in the minimiza- 
tion, and also with the output from our refinement method (2). Only small changes in 
parameters occur, and the reduction in reconstruction error is minimal. 

5.1 Real Data 

Two zoom sequences of a point grid, were taken with one of the cameras on a stereo 
head, using the common elevation and one of the vergence axes to generate the motion. 
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Homographies: Bundle-adjusted. Self-calibration: LM 



LM vary pp, no refinement 








— LM vary pp, no refinement 


Bundle-adj. from LM vary pp 








Bundle-adj. fromLM vary pp 


- - ■ LM fixed pp, no refinement 








- - ■ LM fixed pp, no refinement 


Bundle-adj. from LM fixed pp 


X 






Bundle-adj. fromLM fixed pp 


Bundle-adj. from Method (2) 


S , 






Bundle-adj. from Method (2) 




Homographies: Bundle— adjusted. Self— calibration: LM 



10 20 30 40 50 60 70 80 90 

Average relative angle between cameras (degrees) 




Average relative angle between cameras (degrees) 



Fig. 5. A good initial estimate is crucial for bundle-adjustment. Using the present method for 
initialization yields much better results than when the bundle-adjustment is applied directly after 
self-calibration. 



■ ■■■•■a ■ 
■ ■ 




• r" 


••••■•■ 




:i 


• 1 ■ ■ • 
1 •■ ■ ■ ■ 9 


1 


u» 








Fig. 6. The first, tenth and last images of the 20 frames of the sequences used from each camera. 
That the sequences were taken from viewpoints very close together is reflected in the similarity 
between the sequences. 



Since we know the structure of the grid, we may measure results accurately, and the 
quality of the ensuing reconstruction is easily visualized. In the first sequence the focal 
length was varied between 1400 and 800 pixels (i.e. zooming out) with a circular motion 
of half-cone 2.5°. In the second the focal length decreased from 1700 to 1100 pixels, 
and the circular motion was 2°. Between the sequences the head was moved to provide a 
finite baseline. The angle between the scene and the two optic centres was approximately 
10°. The first, tenth and last images from a 20 image sequence are shown in Figure 0 
The motion in these sequences is very small, and the initial self-calibration was 
found to vary considerably depending on which algorithms were used to calculate the 
homographies and self-calibration, and how many images were used. Results from three 
experiments are summarized in Table 0 and FigureQ shows reconstructions of the scene 
with and without refinement. Again, the novel methods presented in this paper provide 
a very significant improvement. 



6 Conclusions 

In this paper we have shown how systematic inaccuracies in the self-calibration of 
rotating cameras apparent in H can be accounted for by the ambiguities inherent in 
rotating motion fields. These effects are particularly keenly felt when small motions, 
large focal lengths, short image sequences and a poor spread of image features are 
involved. 
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y 

(a) (b) (c) 

Fig. 7. Plan views of the reconstructed scene, (a) represents the fourth row of Table 0 using 
homographies calculated from the linear method, and the relative motion from a single image pair 
with no rehnement. (b) demonstrates refinement method (1) applied to this reconstruction (row 6 
in the table), (c) was obtained using bundle-adjusted homographies and a final bundle-adjustment 
of the motion and structure on the output from method (2) (row 10 in the table). 




Table 3. Results of reconstruction of the calibration grid. 



No. of 
images 


Homography 

calculation 


Self-calibration 

algorithm 


Refinement method 


Angle bet- 
ween planes 


Reconstruction 
error {%) 


Reprojection 
error (pixels) 


5 


Linear 


Linear, known pp 


Single image, no ref. 


53.0 


81.7 


0.544 


5 


Linear 


Linear, known pp 


Multiple images, no ref. 


52.7 


79.0 


0.489 


5 


Linear 


Linear, known pp 


Method (1) 


109.7 


19.9 


0.298 


8 


Linear 


Linear, known pp 


Single image, no ref. 


33.3 


110.2 


0.422 


8 


Linear 


Linear, known pp 


Multiple images, no ref. 


93.4 


8.3 


0.316 


8 


Linear 


Linear, known pp 


Method (1) 


95.7 


8.0 


0.259 


20 


Bundle-adj. 


LM, varying pp 


Single image, no ref. 


98.8 


9.2 


0.336 


20 


Bundle-adj. 


LM. varying pp 


Method (1) 


88.1 


3.2 


0.349 


20 


Bundle-adj. 


LM. varying pp 


Method (2) 


88.8 


3.0 


0.361 


20 


Bundle-adj. 


LM, varying pp 


Bundle-adj. 


90.6 


2.8 


0.327 



The paper has also demonstrated that the epipolar geometry between multiple rotating 
cameras can and should be exploited to refine the initial self-calibration of the sets of 
intrinsic parameters, and hence to improve recovered scene structure. The improvements 
can be substantial. 

By experiment, it has been shown too that, by itself, a Euclidean bundle-adjustment 
cannot resolve the ambiguities. Methods such as those presented here are required to 
initialize the adjustment. Interestingly, especially for those concerned with on-line time- 
sensitive implementations, the initialized position is often good enough for bundle- 
adjustment to make rather little improvement. In current work we are exploring the 
reduction of the parameters in bundle-adjustment just to those which appear poorly 
estimated from independent self-calibration of each camera. However it appears that the 
cost function surface of the reprojection error is still peppered with local minima. 

The ambiguity between focal length and rotation is apparent as the bas-relief ambi- 
guity in sequences with general motion, and can rapidly lead to disastrous results O. 
The reconstructed scene appears skewed relative to the true configuration, and length 
ratios are not preserved, implying that the upgrade from projective to Euclidean structure 
has not been successful. This is precisely the kind of behaviour we observe here (eg. 
FigOta)) with reconstructions from multiple rotating cameras if the ambiguity between 
focal length and rotation is not resolved. It is found that the near ambiguity between 
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principal point and motion does not have as great an impact on the resulting Euclidean 
reconstruction, and resolving its effects are more difficult. 

In future work we intend to extend the analysis to multi-focal constraints. This has 
the added benefit of better constrained matching. 
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Abstract. In this paper it is shown how to perform hand-eye calibration 
using only the normal flow field and knowledge about the motion of the 
hand. The proposed method comprise a simple way to calculate the hand- 
eye calibration when a camera is mounted on a robot. Firstly, it is shown 
how the orientation of the optical axis can be estimated from at least 
two different translational motions of the robot. Secondly, it is shown 
how the other parameters can be obtained using at least two different 
motions containing also a rotational part. In both stages, only image 
gradients are used, i.e. no point matches are needed. As a by-product, 
both the motion field and the depth of the scene can be obtained. The 
proposed method is illustrated in experiments using both simulated and 
real data. 



1 Introduction 

Computer vision and autonomous systems have been fields of active research 
during the last years. One of the interesting applications is to combine computer 
vision techniques to help autonomous vehicles performing their tasks. In this 
paper we are aiming at an application within robotics, more specifically, using 
a camera mounted on the robot arm to aid the robot in performing different 
tasks. When a camera is mounted by an operator onto the robot arm, it cannot 
be assumed that the exact location of the camera with respect to the robot is 
known, since different cameras and mounting devices can be used. It might even 
be necessary to have a flexible mounting device in order to be able to perform 
a wide variety of tasks. This problem, called hand-eye calibration, will be dealt 
with in this paper. 

Hand-eye calibration is an important task due to (at least) two applications. 
Firstly, in order to have the vision system guiding the robot to, for example, 
grasp objects, the orientation of the different coordinate systems are essential to 
know. Secondly, when the robot is looking at an object and it is necessary to take 
an image from a different viewpoint the hand-eye calibration is again necessary. 
We will throughout the paper assume that the robot-hand calibration is known, 
which implies that the relation between the robot coordinate system and the 
hand coordinate system is known. This assumption implies that we may take 

* This work has been supported by the Swedish Research Council for Engineering 
Sciences (TER), project 95-64-222 
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advantage of the possibility to move the hand of the robot in any predetermined 
way with respect to the robot coordinate system. In fact, this possibility will 
be used as a key ingredient in the proposed method for hand-eye calibration. 
We will furthermore assume that the camera is calibrated, i.e. that the intrinsic 
parameters are known. The problem of both recovering the hand-eye calibration 
and the robot-hand calibration has been treated in [3, 13, 12]. 

Hand-eye calibration has been treated by many researchers, e.g. [10, 11, 2, 4]. 
The standard approach relies on (i) a known reference object (calibration object) 
and (ii) the possibility to reliably track points on this reference object in order 
to obtain corresponding points between pairs of images. This approach leads to 
the study of the equation AX — XB, where A, X and B denote 4x4 matrices 
representing Euclidean transformations. A and B denote the transformations 
between the first and second position of the robot hand (in the robot coordinate 
system) and the camera (in the camera coordinate system - estimated from 
point correspondences) respectively and X denotes the transformation between 
the hand coordinate system and the camera coordinate system, i.e. the hand-eye 
calibration. 

The hand-eye calibration problem can be simplified considerably by using 
the possibility to move the robot in a controlled manner. In [8] this fact has 
been exploited by first only translating the camera in order to obtain the rota- 
tional part of the hand-eye transformation and then making motions containing 
also a rotational part in order to obtain the translational part of the hand-eye 
calibration. This approach makes it also possible to cope without the calibration 
grid, but it is necessary to be able to detect and track points in the surrounding 
world. 

We will go one step further and solve the hand-eye calibration problem with- 
out using any point correspondences at all. Instead we will use the normal flow 
(e.g. the projection of the motion held along the normal direction of the image 
gradients - obtained directly from the image derivatives) and the possibility to 
make controlled motions of the robot. We will also proceed in two steps; (i) mak- 
ing (small) translational motions in order to estimate the rotational part of the 
hand-eye calibration and (ii) making motions containing also a rotational part 
in order to estimate the translational part. We will show that it is sufficient, at 
least theoretically, to use only two translational motions and two motions also 
containing a rotational part. The idea to use only the normal flow (instead of an 
estimate of the optical flow obtained from the optical flow constraint equation 
and a smoothness constraint) has been used in [1] to make (intrinsic) calibration 
of a camera. In this paper we will use this approach to make hand-eye (extrinsic) 
calibration. 

Our method for hand-eye calibration boils down to recovering the motion of 
the camera, using only image derivatives, when we have knowledge about the 
motion of the robot hand. This work has a lot in common with the work by 
Horn and Weldon [6] and Negahdaripour and Horn [9], where the same kind of 
intensity constraints are developed. We use, however, an active approach which 
allows us to choose the type of motions so that, for example, the unknown depth 
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parameter Z can be effectively eliminated. The equations are developed with the 
goal of a complete hand-eye calibration in mind and so that they effectively uses 
the information obtained in the preceding steps of the algorithm. 

The paper is organized as follows. In Section 2 a formal problem formulation 
will be given together with some notations. The hand-eye calibration problem 
will be solved in Section 3, where the estimate of the rotational part will be 
given in Section 3.1 and the estimate of the translational part will be given in 
Section 3.2. Some preliminary experimental results on both synthetic and real 
images will be given in Section 4 and some conclusions and directions of further 
research will be given in Section 5. 



2 Problem Formulation 



Throughout this paper we represent the coordinates of a point in the image 
plane by small letters {x,y) and the coordinates in the world coordinate frame 
by capital letters [X, Y, Z). In our work we use the pinhole camera model as our 
projection model. That is the projection is governed by the following equation 
were the coordinates are expressed in homogeneous form, 



A 











'X' 


X 




7 / sf Xo 




Y 


y 
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0 / 2/0 


[R \ -Rt] 
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1 




0 0 1 




1 



( 1 ) 



Here, / denotes the focal length, 7 and s the aspect ratio and the skew and 
(2;o7yo) the principal point. These are called the intrinsic parameters. Fur- 
thermore, R and t denote the relation between the camera coordinate system and 
the object coordinate system, where R denotes a rotation matrix and t a trans- 
lation vector, i.e. a Euclidean transformation. These are called the extrinsic 
parameters. 

In this study of hand-eye calibration we assume that the camera is calibrated, 
i.e. that the intrinsic parameters are known, and that the image coordinates of 
the camera have been corrected for the intrinsic parameters. This means that 
the camera equation can be written as in (1) with / = !, 7 = 1, s = 0 and 
(a^ojyo) = (0,0). With these parameters the projection simply becomes 

r ^ 



where the object coordinates have been expressed in the camera coordinate sys- 
tem. 

The hand-eye calibration problem boils down to finding the transformation 
H = {R, t) between the robot hand coordinate system and the camera coordinate 
system, see Figure 1. In the general case this transformation has 6 degrees of 
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freedom, 3 for the position, defined by the 3-vector t and 3 for the orientation, 
defined by the orthogonal matrix R. We will solve these two parts separately, 
starting with the orientation. 




H = (R,t) 



Fig. 1. The relation between the robot coordinate system and the camera coordinate 
system. 



To find the orientation of the camera we will calculate the direction D = 
{Dx,Dy,Dz), in the camera coordinate system, of at least two known trans- 
lations of the robot hand, in the robot hand coordinate system. The relation 
between these directions in the two different coordinate systems will give us 
the orientation, R, between the two coordinate systems in all the 3 degrees of 
freedom. 

For the position we would like to find the translation T = {Tx,Ty ,Tz) 
between the robot hand coordinate system and the camera coordinate system as 
seen in the robot coordinate system. Translation of the robot hand coordinate 
system will not give any information about T, as Ma also pointed out in [8]. 
Therefore, we will instead use rotations to find T. The procedure for this will be 
fully explained below in Section 3.2. 

The main goal of our approach is to be able to do a complete hand-eye 
calibration without at any point having to extract any features and match them 
between images. To this end we use the notion of normal flow. The normal flow 
is the apparent flow of intensities in the image plane through an image sequence, 
i.e. the orthogonal projection of the motion field onto the image gradient. We will 
look at only two subsequent images in the current method. We will below briefly 
derive the so called optical flow constraint equation which is the cornerstone in 
our method. Note that our usage of the term normal flow means the motion 
of the intensity patterns, defined by the spatial and temporal image intensity 
derivatives, Ey and Et, is not the same as the estimate of the motion held 
obtained from the optical flow constraint equation and a smoothness constraint. 
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If we think of an image sequence as a continuous stream of images, let 
E{x, y, t) be the intensity at point (x, y) in the image plane at time t. Let u{x, y) 
and v{x,y) denote components of the motion field in the x and y directions 
respectively. Using the constraint that the gray-level intensity of the object is 
(locally) invariant to the viewing angle and distance we expect the following 
equation to be fulfilled. 



E{x + u6t,y + v6t,t + 6t) = E{x,y,t) . 



( 3 ) 



That is, intensity at time t + 6t at point (a; -|- u6x, y -|- v6y) will be the same as 
the intensity at (x,y) at time t. If we assume that the brightness vary smoothly 
with x,y and t we can expand (3) in a Taylor series giving 



r./ c 9E ^ dE BE 



E{x,y,t) 



( 4 ) 



Here, e is is the error term of order By cancelling out E{x, y, t), dividing 

by 6t and taking the limit as 6t ^ 0 we receive 



ExU -\- EyV + Et — 0 , 



( 5 ) 



where 





(6) 



which denotes the motion field. The equation (5) is often used together with a 
smoothness constraint to make an estimation of the motion field, see e.g. [5]. 
In our method we will, however, not need to use such a smoothness constraint. 
Instead we will use what we know about the current motion and write down 
expressions for u{x,y) and v{x,y). By constraining the flow {u,v) in this way 
we will be able write down linear systems of equations of the form (5) which 
we can solve for the unknowns contained in the expressions for u and v. These 
unknowns will be shown to be in direct correspondence to the vectors D and T 
above, that is, to the unknowns of the hand-eye transformation. 



3 Hand-Eye Calibration 



In this section we will, in two steps, explain in detail our method. We will as 
the first step start with the orientation part of the transformation H. This is 
crucial since the result in this part will be used to simplify the calculations of 
the position part. 
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3.1 The Orientation of the Camera 

The orientation of the camera will be obtained by translating the robot hand 
in at least two known separate directions in the robot hand coordinate system. 
As mentioned above we aim at stating expressions for the field components u 
and V. These expressions will contain the unknown direction D in the camera 
coordinate system of the current translation. To make the procedure as easy as 
possible we will choose the directions of the translations so that they coincide 
with two of the axes of the robot hand coordinate system. Then we know that 
the perceived directions in camera coordinate system exactly correspond to these 
two axes in robot hand coordinate system. The direction of the third axis can 
be obtained from the vector product or by making a third translation. 

Following a way of expressing the motion field explained also in e.g. [7] and 
[5], we will start by expressing the motion of a point P — (X,V,Z) in a the 
object coordinate system, which coincide with the camera coordinate system. 
Let V = (X, Y, Z) denote the velocity of this point. 

Now we translate the robot hand along one axis in the robot hand coordinate 
system. Let D — Dy,Dz) denote the translation of the point P in the camera 
coordinate system. Then 



iX,Y,Z) = -{D,,Dy,D,) . (7) 

The minus sign appears because of the fact that we actually model a motion 
of the camera, but instead move the points in the world. The projection of the 
point P is governed by the equations in (2). Differentiating these equation with 
respect to time and using (6) we obtain 

1 

w = -7 “ “tT = - - ^^z), 

Z Z (g) 

Y Y7 1 

v=y = ^-^ = -^iDr-yD,), 

where the projection equations (2) and (7) have been used. The equation (8) 
gives the projected motion field in the image plane expressed in the translation 
D given in the camera coordinate system. This is the motion field that we will 
use together with optical flow constraint equation (5). 

Inserting (8) in (5) gives, after multiplication with Z, 



—ExDx — EyDy + {ExX + Eyy)Dz + EtZ — 0 . (9) 

The spatial derivatives are here taken in the first image, before the motion, so 
they will not depend on the current translation D. Et on the other hand will 
depend on D. Let N denote the number of pixels in the image. We obtain one 
equation of the form (9) for every pixel in the image. If we look at the unknowns 
of this set of equations, we have one unknown depth parameter, Z, for each pixel 
and equation, but the translation parameters [Dx, Dy, Dz) is the same in every 
pixel. Therefore the linear system of equations of the form (9) taken over the 
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whole image has N equations and iV + 3 unknowns. Let A denote the system 
matrix of this system of equations. 

To obtain more equations of the form (9) but only 3 more unknowns we make 
another translation but now in a direction along another axis in the robot hand 
coordinate system. Let D = Dy,Dz) denote this new translation direction in 
the camera coordinate system. Then the new equations of the form (9) becomes 

—ExDx ~ EyDy + {ExX + Eyu)Dz + EtZ = 0 . ( 10 ) 

Notice that Ex and Ey is the same as in (9) but Et have changed to Et . The 
depth parameters Z{x, y) in (10), which is the depth at point {x, y) in the original 
reference image, is the same as in the equations (9) resulting form the first mo- 
tion. The only new unknowns are the new translation vector D — {£>x, Dy,Dz). 
Let A be the system matrix of this new system of equations resulting from the 
second translation. Put together the equations from system A and system A in 
a new linear system M. This system will now have 2N equations but only N + 6 
unknowns. 

Primarily, we are only interested in the unknowns D and D. To make the 
system M more stable with respect to these unknowns, and also in order to 
reduce the number of equations, we will eliminate Z in M. Pair together the 
equations that correspond to the same pixel in the first reference image. Then 
Z can be eliminated from (9) and (10) giving 



— ExEfDx — EyEtDy + {ExX + Eyy)EtD z + ExEtDx + 

+ EyEtDy + {ExX + Eyy)EtD z = 0 ( 11 ) 

Taking the equations of the form (11) for each pixel in the first image a new linear 
system M' is obtained with N equations and only the 6 direction components as 
unknowns. Observe that the estimates of the directions D and D give the known 
directions of the robot hand, v and barv, in the camera coordinate system. 
The rotational part of the hand-eye calibration can easily be obtained from the 
fact that it maps the directions v and barv to the directions D and D and is 
represented by an orthogonal matrix. 

Another approach would be to use also a third translational motion of the 
robot hand, resulting in three different equations like (9) containing 9 different 
translation parameters and N depths. Eliminating the depths for each pixel gives 
two linearly independent equations in the 9 translational parameters for each 
pixel, i.e. in total 2N equations in 9 parameters. Observe also that an estimate 
of the depth of the scene can be obtained by inserting the estimated translation 
parameters into the constraints (9). Note that for pixels where Et = 0 the depths 
can not be estimate from (9) since the coefficient for Z is equal to 0. 



3.2 The Position of the Camera 

To find the position of the camera in relation to the robot hand, rotational 
motions have to be used. Pure translational motions will not give any information 
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about the translation between the robot hand and the camera. We will now write 
down the motion field {u{x, y),v{x, y)) for the rotational case in a similar fashion 
as we did for the translational case. 

Describe the motion of the points in the camera coordinate system by a 
rotation around an axis that does not have to cross the focal point of the cam- 
era. Let Q — {[2xt f^Y, ^z) denote the direction of this axis of rotation and 
P = {X, Y, Z) the coordinates of a point in the camera coordinate system. Fur- 
thermore, let the translation between the origin of the robot hand coordinate 
system and the focal point be described by the vector T — {Tx,Ty,Tz), see 
Figure 2. This is the vector that we want to calculate. 



F P 




Fig. 2. Rotation around the direction 17 = (0,0, 1). The X-axis is pointing out from 
the picture. The orientation of the camera is here only a rotation around the X-axis, for 
simplicity. The dashed coordinate systems corresponds to the first and second centers 
of the rotations. 



Let the axis of rotation cross the origin of the robot hand coordinate system. 
Then the velocity V — (X, Y, Z) of the point P in the camera coordinate system 
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resulting from a rotation around this axis will be equal to — 17 x (P — T), i.e. 

r X = -ny{Z - Tz) + f2z(V - Ty) = -flyZ + HzY + QyTz - fizTy, 

I Y = -nz{X - Tx) + Y!x{Z- Tz) = -^zX + QxZ + QzTx ~ ^xTz, 

[ Z = -HxiY - Ty) + ny{X - Tx) = -f2xY + QyX + QxTy - QyTx . 

(12) 

Here, the equations have been rewritten to the form of a rotation around an axis 
that crosses the focal point plus a term that can be interpreted as a translation 
in the camera coordinate system. Using the velocity vector (12) and equations 
(2) and (8) the motion field now becomes 

{ rr\ rri rri 

u = Qxxy - (1 + a;^)J?F + ^zV + xf2y-^ - {xfix + ^z)^ + 

V = I7x(l + y^) — fiyxy + Qzx + {fiz + y^z)-^ — y^x~^ — ■ 

(13) 

This field will be plugged into the optical flow constraint equation (5). Since we 
know the axis of rotation, the first three terms of the equations for u and v are 
known. Let 

E[= Et+ E^ [fixxy - J7y (1 + x'^) + f2zy) + Ey (l7x(l + y^) - fiyxy + f2zx) 

(14) 

Here, P( is a known quantity and the motion field (13) plugged into (5) can be 
written as (after multiplication with Z) 

ATx + BTy + CTz + ZE't = 0, (15) 

where 

{ A = Exf2yx + Ey{fiz + yl7y), 

B = —Ex{xQx + I7y) — Eyf2xy, (16) 

C = Exf^y — Eyf2x ■ 

This resembles the equations obtained in Section 3.1. We will also here eliminate 
the depth Z by choosing a new motion. A difference is that in this case we are 
not only interested in the direction of the vector T, but also the length of T. 
To be able to calculate this the center of rotation will be moved away from the 
origin of the robot hand a known distance e = {ex, ey,ez) and a new rotation 
axis 17 is chosen. This leads to a new equation of the form (15) 

A{Tx + ex) + B{Ty + ey) + C{Tz + ez) + Z&t = 0 • (17) 

Here, A,B,C and E't corresponds to (19) and (14) for the new rotation axis 17. 
Pairing together (15) and (17) and eliminating the depth Z gives 

{E{A - E{A)Tx + {E{B - E{B)Ty + {E{C - E{C)Tz = E{{Aex + Bey + Cez) 

(18) 
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We have one equation of this kind for each of the N pixels in the image and only 

3 unknowns. Solving the over-determined linear equation system using least- 
squares gives us the translational part of the hand-eye transformation in the 
camera coordinate system. It is then a simple task to transfer this translation 
to the robot hand coordinate system if wanted. 

The direction of the second rotation axis 17 do not necessarily have to be 
different from the first 17. For example, choosing Q — Q — (0,0,1), i.e. the 
rotation axis is parallel to the optical axis, we get 

r A = A = Ey, 

)b = B = -E,, (19) 

[ C = C = 0 . 

Equation (18) then reduces to 

Ey{E't - E[)Tx - E^{E', - E',)Ty = E[{Eyex - E,ey) (20) 

where 

E't = Et + ExU + EyX, 

El = Et E ExU + EyX . 

This equation system gives a way of calculating Tx and Ty, but Tz is lost and 
must be calculated in another manner. To calculate Tz we can, for example, 
instead choose 17 = 17 = (1, 0, 0) which gives a linear system in Ty and Tz- 

4 Experiments 

In this section the hand-eye calibration algorithm is tested in practice on both 
synthetic and real data. On the synthetic sequence the noise sensitivity of the 
method is examined. 



4.1 Synthetic Data 



A synthetic image E was constructed using a simple raytracing routine. The 
scene consists of the plane Z+^ = 10 in camera coordinate system. The texture 
on the plane is described by I{X, Y) = sin(A') + sin(F) in a coordinate system 
of the plane with origin at O = (0, 0, 10) in the camera coordinate system, see 
Figure 3. The extension of the image plane is from -1 to 1 in both the X and Y 
direction. The image is discretized using a step-size of 6x = 0.01 and 6y — 0.01, 
so that the number of pixels is equal to A” = 201 x 201. 

The spatial derivatives. Ex and Ey^ has in the experiments been calculated 
by convolution with the derivatives of a Gaussian kernel. That is Ex = E * Gx 
and Ey = E * Gy where 



Gx = 



rxe 



+v 



( 22 ) 
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Fig. 3. An image from the computer generated sequence 



The temporal derivatives were calculated by simply taking the difference of 
the intensity in the current pixel between the first and the second image. Before 
this was done, however, the images was convolved with a standard Gaussian 
kernel, of the same scale as the spatial derivatives above. The effect of the scale 
parameter a has not been fully evaluated yet, but the experiments indicate that 
a should be chosen with respect to the magnitude of the motion. If the motion is 
small, a should also be chosen rather small. In the experiments on the orientation 
part, a value of a between 0.5 and 3 was used. In the position part values higher 
than 3 was usually used. 



Orientation We have used three translations at each calculation to get a linear 
system with 3N equations and 9 unknowns. The use of two translations, as 
described in the Section 3.1, gives a system with N equations and 6 unknowns, 
which seemed equally stable. The resulting direction vectors 

{Dx, Dy,Dz, Dx, Dy, Dz,5x,5y, t) z) 

of some simulated translations are shown in Table 1. The result has been nor- 
malized for each 3-vector. 

As a reference, the directions were also calculated using the current method 
with exact spatial and temporal derivatives for each motion, i.e. by calculating 
the derivatives analytically and then using these to set up the linear system 
of equations of the form (11). These calculations gave perfect results, in full 
accordance with the theory. In Table 1 the variable t indicates the time and 
corresponds to the distance that the robot hand has been moved. The value of 
t is the length of the translation in robot hand coordinate system. An apparent 
motion in the image of the size of one pixel corresponds in these experiments 
to approximately t — 0.15. In the first column, t = 0 indicates that the exact 
derivatives are used. 

The experiments shows that the component of the translation in the Z direc- 
tion is most difficult to obtain accurately. The value of this component is often 
underestimated. Translations in the XY - plane, however, always gave near per- 
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feet results. Translations along a single axis also performed very well, including 
the Z-axis. 



Table 1. Results of the motion estimation using the synthetic image data. The pa- 
rameter t indicates the length of the translation vector in the robot hand coordinate 
system. 



t = 0 


t = 0.05, a = 0.5 


t = 0.1, (7 = 0.5 


II 

b 

o 

II 


t = 0.3, a = 1.5 


0.5774 


0.5826 


0.5829 


0.5846 


0.5896 


0.5774 


0.5826 


0.5885 


0.6054 


0.6289 


0.5774 


0.5667 


0.5603 


0.5401 


0.5067 


0.7071 


0.7078 


0.7078 


0.7094 


0.7094 


0.7071 


0.7064 


0.7064 


0.7045 


0.7036 


0 


0.0012 


0.0063 


0.0211 


0.0411 


0 


0.0019 


0.0067 


0.0236 


0.0513 


1 


1.0000 


1.0000 


0.9996 


0.9981 


0 


0.0014 


0.0049 


0.0157 


0.0346 



The method is naturally quite sensitive to noise since it based fully on approx- 
imations of the intensity derivatives. The resulting direction of some simulated 
translations is shown in Table 2 together with the added amount of Gaussian 
distributed noise. The parameter tJ„ corresponds to the variation of the added 
noise. This should be put in relation to the intensity span in the image, which 
for the synthetic images is —2 to 2. If the image is a standard gray-scale image 
with 256 grey-levels, (j„ = 0.02 corresponds to approximately one grey-level of 
added noise. 



Table 2. Results of the motion estimation using images with added Gaussian noise. 
The parameter (7„ indicates the variation of the added noise. The intensity span in the 
image is from —2 to 2. 



t = 0 


t = 0.1 


t = 0.1 


t = 0.1 


t = 0.1 


( 7 „ = 0 


Un = 0.01 


(Jn = 0.02 


( 7 „ = 0.03 


cr„ = 0.05 


0.5774 


0.5899 


0.5833 


0.5928 


0.5972 


0.5774 


0.5944 


0.6016 


0.6088 


0.6132 


0.5774 


0.5465 


0.5457 


0.5272 


0.5171 


0.7071 


0.7083 


0.7030 


0.7074 


0.7025 


0.7071 


0.7058 


0.7109 


0.7066 


0.7117 


0 


0.0145 


0.0208 


0.0180 


0.0054 


0 


0.0132 


0.0185 


0.0145 


0.0118 


1 


0.9998 


0.9997 


0.9999 


0.9999 


0 


0.0117 


0.0132 


0.0094 


0.0036 
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Position The algorithm for obtaining the translational part of the hand-eye 
transformation was tested using the same kind of synthetic images as in the 
preceding section. The experiments shows that the algorithm is sensitive to the 
magnitude of the angle of rotation and to the choice of the vector e. 

The algorithm performs well when using equation (20) to obtain Tx and Ty- 
Using this equation with T = (1, 1, l),e = (3, 2, 1) and the angle of rotation being 
e = tt/240, we get Tx = (1.0011) and Ty = (1.0254). With T = (7,9,5),e = 
(3, 1, 0) and 0 = tt/240, we get Tx = 7.0805 and Ty = 8.9113. In these examples 
the scale of the Gaussian kernels was chosen as high as cr = 6. 

The component Ty is however more difficult obtain accurately. Using 17 = 
17 = (0, 0, 1) as mentioned at the end of Section 3.2 and the same T,e and 9 as in 
the latter of the preceding examples, we get Ty = 9.1768 and Ty = 5.2664. This 
represents an experiment that worked quite well. Choosing another e and 0, the 
result could turn out much worse. More experiments and a deeper analysis of 
the method for obtaining the position is needed to understand the instabilities 
of the algorithm. 



4.2 Real Data 

To try the method on a real hand-eye system, we used a modified ABB IRB2003 
robot which is capable of moving in all 6 degrees of freedom. The camera was 
mounted by a ball head camera holder on the hand of the robot so that the 
orientation of the camera could be changed with respect to the direction of the 
robot hand coordinate system. 

The two scenes, A and B, that has been used in the experiments, consist of 
some objects placed on a wooden table. Scene A is similar to scene B, except for 
some additional objects, see Figure 4. 




Fig. 4. Two images from the real sequences A (left) and B (right) 



Three translations were used for each scene. In the first translation the robot 
hand was moved along its Z-axis towards the table, in the second along the 
X-axis and in the third along the U-axis. The orientation of the camera was 
approximately a rotation of — ^ radians around the Z-axis with respect to the 
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robot hand coordinate system. Therefore, the approximate normed directions 
of the translations as seen in the camera coordinate system were D = (0, 0, 1), 
5 = ^(-1,1,0) and 5 = ^(1,1,0). 

The results of calculating the direction vectors in sequence A and B, using 
the method in Section 3.1, are presented in Table 3. The derivatives were in 
these experiments calculated using a — 2. The motion of the camera was not 
very precise and it is possible, for example, that the actual motion in sequence B 
for the translation along the Z-axis also contained a small component along the 
X-axis. More experiments on real image sequences are needed to fully evaluate 
the method. 



Table 3. The approximate actual motion compared with the calculated motion vectors 
from sequence A and B respectively. The direction vectors are normed for each 3-vector. 



Approx, actual motion 


Calculated motion seq. A 


Calculated motion seq. B 


0 


0.1415 


0.4068 


0 


0.0225 


0.1499 


1 


0.9897 


0.9011 


-0.7071 


-0.5428 


-0.6880 


0.7071 


0.8345 


0.7184 


0 


0.0943 


0.1028 


0.7071 


0.7138 


0.6492 


0.7071 


0.7001 


0.7590 


0 


0.0183 


0.0495 



5 Conclusions and Future Work 

We have in this paper proposed a method for hand-eye calibration using only 
image derivatives, the so called normal flow field. That is, we have only used 
the gradients of the intensity in the images and the variation of intensity in 
each pixel in an image sequence. Using known motions of the robot hand we 
have been able to write down equations for the possible motion held in the 
image plane in a few unknown parameters. By using the optical flow constraint 
equation together with these motion equations we have written down linear 
systems of equations which could be solved for the unknown parameters. The 
motion equations was constructed so that the unknown parameters correspond 
directly to the unknowns of the transformation H between the camera coordinate 
system and the robot hand coordinate system. 

The work was divided in two parts, one for the orientation, i.e. the rotation 
of camera with respect to the robot hand coordinate system, and one for the 
position of the camera, i.e. the translation between the camera coordinate system 
and the robot hand coordinate system. Some preliminary experiments has been 
made on synthetic and real image data. 
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The theory shows that a hand-eye calibration using only image derivatives 
should be possible but the results of the experiments on real data are far from 
the precision that we would like in a hand-eye calibration. This is, however, our 
first study of these ideas and the next step will be to find ways to make the 
motion estimation more stable in the same kind of setting. One thing to look 
into is the usage of multiple images in each direction so that, for example, a 
more advanced approximation of the temporal derivatives can be used. 
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Abstract. This paper presents an original method for three-dimensional 
elastic registration of multimodal images. We propose to make use of a 
scheme that iterates between correcting for intensity differences between 
images and performing standard monomodal registration. The core of 
our contribution resides in providing a method that finds the transfor- 
mation that maps the intensities of one image to those of another. It 
makes the assumption that there are at most two functional dependen- 
ces between the intensities of structures present in the images to register, 
and relies on robust estimation techniques to evaluate these functions. 
We provide results showing successful registration between several ima- 
ging modalities involving segmentations, T1 magnetic resonance (MR), 
T2 MR, proton density (PD) MR and computed tomography (CT). 
keywords: Multimodality, Elastic registration. Intensity correction. Ro- 
bust estimation. Medical imaging. 



1 Introduction 

Over the last decade, automatic registration techniques of medical images of the 
head have been developed following two main trends: 1) registration of multimo- 
dal images using low degree transformations (rigid or affine), and 2) registration 
of monomodal images using high-dimensional volumetric maps (elastic or fluid 
deformations) . The first category mainly addresses the fusion of complementary 
information obtained from different imaging modalities. The second category’s 
predominant purpose is the evaluation of either the anatomical evolution pro- 
cess present in a particular subject or of anatomical variations between different 
subjects. 

These two trends have evolved separately mainly because the combined 
problem of identifying complex intensity correspondences along with a high- 
dimensional geometrical transformation defines a search space arduous to tra- 
verse. Recently, three groups have imposed different constraints on the search 
space, enabling them to develop automatic multimodal non-affine registration 
techniques. All three methods make use of block matching techniques to eva- 
luate local translations. Two of them use mutual information (MI) |31)ll 7j as a 
similarity measure and the other employs the correlation ratio m- 
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An important aspect when using MI as a registration measure is to compute 
the conditional probabilities of one image’s intensities with respect to those of 
the other. To do so, Maintz et al. US] proposed to use conditional probabilities 
after rigid matching of the images as an estimate of the real conditional probabi- 
lities after local transformations. Hence, the probabilities are evaluated only once 
before fluid registration. However, Gaens et al. CH argued that the assumption 
that probabilities computed after afflne registration are good approximations of 
the same probabilities after fluid matching, is unsuitable. They also proposed 
a method in which local displacements are found so that the global MI increa- 
ses at each iteration, permitting incremental changes of the probabilities during 
registration. Their method necessitates the computation of conditional probabi- 
lities over the whole image for every voxel displacement. To alleviate themselves 
from such computations owing to the fact that MI requires many samples to 
estimate probabilities, Lau et al. PEI have chosen a different similarity measure. 
Due to the robustness of the correlation ratio with regards to sparse data pg, 
they employed it to assess the similarity of neighbouring blocks. Hence no global 
computation is required when moving subregions of the image. 

Our method distinguishes itself by looking at the problem from a different 
angle. In the last years, our group has had some success with monomodal image 
registration using the demons method an optical flow variant when de- 

aling with monomodal volumetric images. If we were able to model the imaging 
processes that created the images to register, and assuming these processes are 
invertible, one could transform one of the images so that they are both repre- 
sented in the same modality. Then, we could use our monomodal registration 
algorithm to register them. We have thus developed a completely automatic me- 
thod to transform the different structures intensities in one image so that they 
match the intensities of the corresponding structures in another image, and this 
without resorting to any segmentation method. 

The rational behind our formulation is that there is a functional relations- 
hip between the intensity of a majority of structures when imaged with diffe- 
rent modalities. This assumption is partly justified by the fact that the Woods 
criterion inn as well as the correlation ratio m, which evaluate a functional 
dependence between the intensities of the images to match, have been used with 
success in the past, and sometimes lead to better results than MI, which assumes 
a more general relation EMU- 

The idea of estimating an intensity transformation during registration is not 
new in itself. For example, Feldmar et al. EDI as well as Barber Q have both 
published methods in which intensity corrections are proposed. These methods 
restrict themselves to affine intensity corrections in a monomodal registration 
context. We propose here a procedure based on one or two higher degree poly- 
nomials found using a robust regression technique to enable the registration of 
images from different modalities. 

The remaining sections of this paper are organized in the following manner. 
First, we detail our multimodal elastic registration method. We then describe 
what kind of images were used to test the method and how they were acqui- 
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red. Next, results obtained by registering different images obtained from several 
modalities are presented and discussed. We conclude this paper with a brief 
discussion on future research tracks. 



Our registration algorithm is iterative and each iteration consists of two parts. 
The first one transforms the intensities of anatomical structures of a source 
image S so that they match the corresponding structures intensities of a target 
image T. The second part regards the registration of S (after intensity transfor- 
mation) with T using an elastic registration algorithm. 

In the following, we first describe the three-dimensional geometrical trans- 
formation computation and then the intensity transformation computation. We 
believe this ordering is more convenient since it is easier to see what result must 
provide the intensity transformation once the geometrical transformation proce- 
dure is clarified. 

2.1 Geometrical Transformation 

Many methods have been developed to deform one brain so its shape matches 
that of another Eg. The one used in the present work is an adaptation of 
the demons algorithm )2YI2?S) . Adjustments were performed based on empirical 
observations as well as on theoretical grounds which are discussed below. For 
each voxel with position x in T, we hope to find the displacement v{x) so that 
X matches its corresponding anatomical location in S. In our algorithm, the 
displacements are computed using the following iterative scheme. 



where Ga is a Gaussian kernel, ® denotes the three-dimensional convolution, 
o denotes the composition and the transformation h{x) is related to the dis- 
placement by h{x) = a; -|- v{x). As is common with registration methods, we 
also make use of multilevel techniques to accelerate convergence. Details about 
the number of levels and iterations as well as filter implementation issues are 
addressed in Section 0] We here show how our method can be related to other 
registration methods, notably the minimization of the sum of squared difference 
(SSD) criterion, optical flow and the demons algorithm. 

Relation with SSD Minimization In this framework, we find the transfor- 
mation h that minimizes the sum of squared differences between the transformed 
source image and the target image. The SSD between the two images for a given 
transformation h applied to the source is defined as 



2 Method 



Vn+l{x) = Ga ^ {Vn + 



\\{VSohr,){xW + [Soh^{x)-T{x)]^ 



S o hn{x) — T{x) 





x—1 



( 2 ) 
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The minimization of Equation 0 may be performed using a gradient descent 
algorithm. Thus, differentiating the above equation we get 

VSSD{h) = -[S o h{x) - T{x)]{VS o h){x). 

The iterative scheme is then of the form, 

hn+i = hn + a[5' o hn(x) - T{x)]{VS o hn){x), 

where a is the step length. This last equation implies, 

Vn+l =Vn+ a[S' o hn{x) - T(x)](VS' o hn){x). (3) 

If we set a to a constant value, this method corresponds to a steepest gradient 
descent. By comparing Equation Q to Equation 0, one sees that our method 
sets 



||(V5oh„)(x)|p + [r(x)-5oh„(x)]2 

and applies a Gaussian filter to provide a smooth displacement field. Cachier 
et al. ft)l2l)] have shown that using Equation (01 closely relates Equation m with 
a second order gradient descent of the SSD criterion, in which each iteration n 
sets hn+i to the minimum of the SSD quadratic approximation at /i„. We refer 
the reader to these articles for a more technical discussion on this subject as well 
as for the formula corresponding to the true second order gradient descent. 



Relation with Optical Flow T and S are considered as successive time sam- 
ples of an image sequence represented by I{x,t), where x = (xi,X 2 ,X 3 ) is a 
voxel position in the image and t is time. The displacements are computed by 
constraining the brightness of brain structures to be constant in time, so that 
the following equality holds P|: 

dl 

— +vVJ = Q. (5) 

at 

Equation 0 is however not sufficient to provide a unique displacement for each 
voxel. By constraining the displacements to always lie in the direction of the 
brightness gradient Vj,/, we get: 



v{x) 



dl{x, t)/dt 

l|V./(x,t)||2 



V^I{x,t). 



( 6 ) 



In general, the resulting displacement field does not have suitable smoothness 
properties. Many regularization methods have been proposed to fill this pur- 
pose 0 . One that can be computed very efficiently was proposed by Thirion m 
in his description of the demons registration method using a complete grid of 
demons. It consists of smoothing each dimension of the vector field with a Gaus- 
sian filter. He also proposed to add [dl{x,t) /dt]"^ to the denominator of Equa- 
tion O for numerical stability when Va,/(x,f) is close to zero, a term which 
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serves the same purpose as o? in the original optical flow formulation of Horn 
and Schunck M- As is presented by Bro-Nielsen and Gramkow |^, this kind of 
regularization approximates a linear elasticity transformation model. 

With this in mind, the displacement that maps a voxel position in T to its 
position in S is found using an iterative method, 






Vn. - 



dl{x,t)/dt 



|Va,/(a;,f)||2 + [dl{x,t)/dt]- 



■;V^I{x,t) 



( 7 ) 



Spatial derivatives may be computed in several ways mm- We have ob- 
served from practical experience that our method performs best when they are 
computed from the resampled source image of the current iteration. As shown 
in Section tz. II this is in agreement with the SSD minimization. Temporal deri- 
vatives are obtained by subtracting the target images from the resampled source 
image of the current iteration. These considerations relate Equation (0 to Equa- 
tion O- The reader should note that the major difference between this method 
and other optical flow strategies is that regularization is performed after the 
calculation of the displacements in the gradient direction instead of using an 
explicit regularization potential in a minimization framework. 



Relation with the Demons Algorithm Our algorithm actually is a small 
variation of the demons method using a complete grid of demons, itself 

closely related to optical flow as described in the previous section. The demons 
algorithm finds the displacements using the following formula. 



— G (j O I Vn -\- 



S o hn{x) — T{x) 



||Vr(cr)|P + [5o/r„(a:)-r(cr)]2 



VT{x) 



As can be seen from the last equation, the only difference between our formu- 
lation (Equation (f^)) and the demons method is that derivatives are computed 
on the resampled source image of the current iteration. This modification was 
performed following the observations on the minimization of the SSD criterion. 



2.2 Intensity Transformation 

Previous to each iteration of the geometrical transformation, an intensity cor- 
rection is performed on S so that the intensities of its structures match those 
in T. The displacement held is then updated by replacing S with its intensity 
corrected version in Equation O- 

The intensity correction process starts by defining the set C of intensity 
couples from corresponding voxels of T and of the current resampled source 
image S o h, which will be designated by S in this section for simplicity. Hence, 
the set C is defined as 



C'={(.5(^),^(^));l<^< a}. 
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where N is the number of voxels in the images. S{i) and T(i) correspond to 
the intensity value of the i**' voxel of S and T respectively when adopting the 
customary convention of considering images as one-dimensional arrays. From 
there, we show how to perform intensity correction if one or two functional 
dependences can be assumed between the structures intensities. 

Monofunctional Dependence Assumption Our goal is to model the trans- 
formation that characterizes the mapping from voxel intensities in S to those 
in T, knowing that some elements of C are erroneous, i.e. that would not be 
present in (7 if S' and T were perfectly matched. If we can assume a monofun- 
ctional dependence of the intensities of T with regards to those of S as well as 
additive stationary Gaussian white noise rj on the intensity values of T, then we 
can adopt the model 

T(*) = /(S(»))+77(z), (8) 

where / is an unknown function to be estimated. This is exactly the model 
employed in which leads to the correlation ratio as the measure to be 

maximized for registration. In that approach, for a given transformation, one 
seeks the function that best describes T in terms of S. It is shown that, in a 
maximum likelihood context, the intensity function / that best approximates / 
is a least squares (LS) fit of T in terms of S. 

Here the major difference is that we seek a high-dimensional geometrical 
transformation. As opposed to affine registration where the transformation is 
governed by the majority of good matches, we have seen in Section 12.1 1 that 
using the elastic registration model, displacements are found using mainly local 
information (i.e. gradients, local averages, etc.). Hence, we can not expect good 
displacements in one structure to correct for bad ones in another; we have to 
make certain each voxel is moved properly during each iteration. For this, since 
the geometrical transformation is found using intensity similarity, the most pre- 
cise intensity transformation is required. Consequently, instead of performing a 
standard least squares regression, we have opted for a robust linear regression 
estimator which will remove outlying elements of C during the estimation of the 
intensity transformation. To estimate / we use the least trimmed squares (LTS) 
method followed by a binary reweighted least squares (RLS) estimation |25| . The 
combination of these two methods provides a very robust regression technique 
with outliers detection, while ensuring that a maximum of pertinent points are 
used for the final estimation. 

Least Trimmed Squares Computation For our particular problem, we will con- 
strain the unknown function / to be a polynomial function with degree p: 

f{s) = Oq 9iS 023^ 9pS^ , 

where we need to estimate the polynomial coefficients 9 — [0q, . . . , 0p]. A regres- 
sion estimator will provide a, 6 — [9 q, . . . ,9p] which can be used to predict the 
value of T(i) from S{i), T{i) = Oq + 9iS{i) + 02S{i)^ -I- • • • -I- 9pS{iY, as well 



Multimodal Elastic Matching of Brain Images 517 



as the residual errors r{i) = T{i) — T{i). A popular method to obtain 6 is to 
minimize the sum of squared residual errors, 

N 

0 — argmin^^r(j)^, 

i=l 

which leads to the standard LS solution. It is found by solving a linear system 
using the Singular Value Decomposition (SVD) method. This method is known 
to be very sensitive to outliers and thus is expected to provide a poor estimate of 
the monofunctional mapping from S to T. The LTS method solves this problem 
by minimizing the same sum on a subset of all residual errors, thus rejecting 
large ones corresponding to outliers, 

h 

6 = arg min 

i=l 

where p{i) is the i**' smallest value of the set {r(l)^, . . . , r(7V)^}. This corresponds 
to a standard LS on the c values that best approximates the function we are 
looking for. Essentially, c/N represents the percentage of “good” points in C 
and must be at least 50%. A lesser value would allow to estimate parameters 
that model a minority of point which could then all be outliers. The value of c 
will vary according to the modalities used during registration. Assigning actual 
values to c is postponed to Section 0 

Our method for LTS minimization is a simple iterative technique. First, we 
randomly pick c/N points from C. We then iterate between calculating 6 using 
the standard LS technique on the selected points and choosing the h/N closest 
points from C . This process is carried until convergence, usually requiring less 
than 5 iterations and is guaranteed to find at least a local minimum of the LTS 
criterion m- 



Reweighted Least Squares Computation As discussed in |2S|, the LTS method is 
very robust, but it tends to provide an estimate 9 that is notably less accurate 
than that we would obtain with a standard LS in the absence of outliers. The 
solution may be refined by considering all the points that relate well to the LTS 
estimate, not only the best c/N x 100%. An efficient technique to achieve this is 
the so-called RLS regression which minimizes the sum of squared residuals 
over all the points that are not “too far” from the LTS estimate, 

N 

9 = arg min Wjr(i) , where w 

i=l 

where <7 is a scale parameter which actually estimates the standard deviation 
of the Gaussian noise 77 introduced in Equation (jSI). Such an estimate can be 
computed directly from the final value of the LTS criterion. 



1 if r{i) < 3(t, 
0 otherwise. 






K 



i=l 






with 



1 



x^g{x) dx, 



a = 



c 



— a 



( 9 ) 
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where g{x) is the Gaussian distribution A^(0, 1) and a is the (0.5 + c/2A^)*^ 
quantile of g{x). In Equation AT is a normalization factor introduced because 
the LTS criterion is not a consistent estimator of a when the r(i) are distributed 
like fV(0,(T^), except when c = N. 

Bifunctional Dependence Assumption Functional dependence as expressed 
in Equation (|Sl) implicitly assumes that two structures having similar intensity 
ranges in S should also have similar intensity ranges in T. With some com- 
binations of multimodal images, this is a crude approximation. For example, 
ventricles and bones generally give similar response values in a MR T1 weighted 
image while they appear with very distinct values in a CT scan. Conversely, 
white and black matter are well contrasted in a T1 image while corresponding 
to similar intensities in a CT. 

To circumvent this difficulty, we have developed a strategy that enables the 
mapping of an intensity value in S to not only one, but two possible intensity 
values in T. This method is a natural extension of the previous method. Instead 
of computing a single function that maps the intensities of S to those of T, two 
functions are estimated and the mapping becomes a weighted sum of these two 
functions. 

We start with the assumption that if a point has an intensity s in S, the 
corresponding point in T has an intensity t that is normally distributed around 
two possible values depending on s, fe{s) and In statistical terms, this 

means that, given s, t is drawn from a mixture of Gaussian distribution. 



where 7Ti(s) and 7T2(s) = 1 — 7Ti(s) are mixing proportions that depend on the 
intensity in the source image, and cr^ represents the variance of the noise in the 
target image. Consistently with the functional case, we will restrict ourselves to 
polynomial intensity functions, i.e. /e(s) — Oq + 0 is + 02 S^ -I- • • • -I- 0 psP, and 

f^{s) = V'o + V'lS + H h 'ijjpSP. 

An intuitive way to interpret this modelling is to state that for any voxel, 
there is a binary “selector” variable e = {1,2} that would tell us, if it was 
observed, which of the two functions fe or actually serves to map s to t. 
Without knowledge of e, the best intensity correction to apply to S (in the sense 
of the conditional expectation m) is seen to be a weighted sum of the two 
functions. 



in which the weights correspond to the probability that the point be mapped 
according to either the first or the second function. We see that the intensity 
correction is now a function of both s and t. Applying Bayes’ law, we find that 



P{t\s) = TTi{s)N{f 0 {s),a'^) +n 2 {s)N{f^{s),a'^), 



( 10 ) 



/(s, t) = P{e = l|s, t)fe{s) + P{e = 2|s, t)f^{s), 



( 11 ) 



for e = {1, 2}: 



P{e\s)P{t\e,s) 

P{t\s) 



P{e\s,t) 
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and thus, using the fact that P{e\s) = TTe{s) and P{t\e,s) = G„{t — /e(s)), the 
weights are determined by 



P(e|s,t) 



7Te(s) Gg{t- fejs)) 

7Ti(s) Ga{t - fe(s)) + 7T2(s) Ga{t - /■ 4 ,{s)) ’ 



( 12 ) 



where it should be clear from the context that = fg if e = 1, and if 

e = 2. 

In order to estimate the parameters of the model, we employ an ad hoc stra- 
tegy that proceeds as follows. First, 6 is estimated using the LTS/RLS method 
described in section The points not used to compute 6, in a number between 
0 and N — c, are used to estimate 'i/j still using the same method. Note that if this 
number is less than lOxp, p being the polynomial degree, functional dependence 
is assumed and we fall back to the monofunctional assumption. 

This provides a natural estimation of the “selector” variable for each voxel: 
the rii points that were used to build fg are likely to correspond to e = 1, 
while the ri 2 points used to build are likely to correspond to e = 2. Finally, 
the points that are rejected while estimating tp are considered as bad intensity 
matches. A natural estimator for the variance is then 



(7^ = 



ni 



^2 



ni -I- ri2 



ni + U2 






where af and (t| are the variances found respectively for fg and during the 
RLS regression (See Section IZ.'Zl ). Similarly, the mixing proportions are compu- 
ted according to 



_ rte(s) 
ni{s) + n 2 {s)’ 



e = {l,2}. 



in which ne(s) is the number of voxels having an intensity s and used to build 
the function f^. Notice that in the case where ni(s) = rz 2 (s) = 0 (i.e. no vo- 
xel corresponding to the intensity class s has been taken into account in the 
computation of fg or /.^), then we arbitrarily set the mixing proportions to 
7Ti(s) = # 2 (s) = 0.5. 

The intensity correction of S can now be performed by reinjecting the esti- 
mated parameters in Equations JED and jnj. 



3 Data 

Most of the data used in the following experiments were obtained from Brain- 
Web [,'IISI 1 519] . This tool uses an atlas with a resolution of 1 x 1 x Imm^ compri- 
sing nine segmented regions from which Tl, T2 and PD images can be generated. 
Three images, one of each modality, were generated with the same resolution as 
the atlas, 5% noise and no intensity non-uniformity. Since they are generated 
from the same atlas, they represent the same underlying anatomy and are all 
perfectly matched. 
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We also made use of a T1 MR image and a CT image, both from different 
subjects and having a resolution of 1 x 1 x Imm^. Both these images were affinely 
registered with the atlas using the correlation ratio method ^3] . To differentiate 
the T1 image obtained with the atlas from the other T1 image, the latter will 
be referenced as SCH. 

The images all respect the neurological convention, i.e. on coronal and axial 
slices, the patient’s left is on the left side of the image. 

4 Results and Discussion 

In the following section we present registration results involving images obtained 
from several different kinds of modalities. First, we show a typical example where 
monofunctional dependence can be assumed: registration of an atlas with an MR 
image. Then, more practical examples are shown where images from different 
modalities are registered and where bifunctional dependence may be assumed. 

The multilevel process was performed at three resolution levels, namely 4mm, 
2mm and 1mm per voxel. Displacement fields at one level are initialized from 
the result of the previous level. The initial displacement field vq is set to a zero. 
The Gaussian filter Go- used to smooth the displacement field has a standard 
deviation of 1mm. 128 iterations are performed at 4mm/voxel, 32 at 2mm/voxel 
and 8 at Imm/voxel. We believe that making use of a better stopping criterion, 
such as the difference of the SSD values between iterations, would probably 
improve the results shown below. 



4.1 Monofunctional Dependence 

We present here the result of registering the atlas with SCH. Since the atlas 
can be used to generate realistic MR images, it is safe to assume a functional 
dependence from the intensity of the atlas to that of SCH. Also, since SCH 
and the atlas are well aligned due to the affine registration, we have roughly 
estimated that the number of points already well matched are at least 0.80 x N, 
to which we have set the value of c. Since 10 classes are present in the atlas, the 
polynomial degree chosen was set to 9. 

The result of registration is presented in Figure E For lack of space, we only 
show one set of corresponding slices extracted from the 3D images. However, 
we wish to make clear to the reader that the registration was performed in 3D, 
not slice by slice. More illustrations will be found in H2|. From left to right, 
the first picture shows an axial slice of the atlas. The second one presents the 
corresponding slice of SCH (which was chosen as the target image). The third and 
fourth pictures show the deformed atlas after elastic registration, respectively 
without and with intensity correction. 

As can be seen, large morphometric differences have been corrected. Still, the 
matching is not perfect which may be observed by comparing the shape of several 
structures between SCH and the deformed atlas, e.g. the ventricles and the white 
matter. Registration imperfections are reflected in the intensity corrected image 
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Fig. 1. Axial slices of the atlas to SCH registration result. From left to right: Atlas; 
SCH; atlas without intensity correction after registration with SCH; atlas with intensity 
correction after registration with SCH. 



(right picture), where one may notice that the CSF intensity is slightly brighter 
than that in SCH (as can be seen in the ventricles and around the cortex). 
This problem can also be observed by looking at the intensity transformation 
function presented in Figure 0 The intensity level corresponding to the CSF is 
overestimated due to an overlap of the CSF in the atlas with the gray and white 
matter in SCH, especially around the cortical area which is known to present 
large variations between subjects. 

This is probably an inherent limitation of elastic models when used in the 
context of inter-subject registration. The strong smoothness constraints impo- 
sed by the Gaussian regularization (or related regularization techniques) may 
prevent the assessment of large and uneven displacements required to match the 
anatomical structures of different subjects. To allow for larger displacements, 
another regularization strategy should be used, such as that based on a fluid 
model Pj or on a non-quadratic potential energy HS|. 



4.2 Bifunctional Dependence 

When registering images from different modalities, monofunctional dependence 
may not necessarily be assumed. Here, we applied the method described in Sec- 
ti on 12. 21 where two polynomial functions of degree 12 are estimated. This number 
was set arbitrarily to a relatively high value to enable important intensity trans- 
formations. 

Figure El presents the result of registering T1 with CT. Using these last 
two modalities, most intensities should be mapped to gray and only the skull, 
representing a small portion of the image data, should be mapped to white. After 
affine registration almost all voxels are well matched. Hence, in this particular 
case, we have chosen a high value for c set to 0.90 x N. 

As we can see in Figure El the skull, shown in black in the MR image and 
in white in the CT scan, is well registered and the intensity transformation 
adequate. The top right graph of Figure El presents the functions fe and 
found during the registration process. The red line corresponds to fg and the 
blue one to /^. The line width for a given intensity s is proportional to the value 
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of the corresponding 7 rg(s). The gray values represent the joint histogram after 
registration. As can be observed on this graph, the polynomials found fit well 
with the high density clusters of the joint histogram. Still, some points need to 
be addressed. 




Fig. 2. Axial slices of the T1 to CT registration result. From left to right: Tl; CT; T1 
without intensity correction after registration with Tl; Tl with intensity correction 
after registration with Tl. 



We can observe that due to the restricted polynomial degree, fe, (shown in 
red) oscillates around the CT gray value instead of fitting a strait line. This is 
reflected in the intensity corrected image, where the underlying anatomy can 
still be observed by small intensity variations inside the skull. This artifact has 
insubstantial consequences during the registration process since the difference 
between most of the voxel intensities is zero, resulting in null displacements. 
The displacements driving the deformation will be those of the skull and the 
skin contours, and will be propagated in the rest of the image as an effect of 
smoothing the displacement field. 

We also notice that /.^ (shown in blue), which is mainly responsible for the 
mapping of the skull, does not properly model the cluster it represents for intensi- 
ties smaller than 5. The mapping for these intensities is slightly underestimated. 
This may have two causes. First, as in the previous case, it might be due to the 
restricted polynomial degree. Second, we can notice that some of the background 
values in Tl that have an intensity close to 0 are mapped to gray values in the 
CT which correspond to soft tissues. This means that some of the background 
in the Tl is matched with the skin in the CT. This has the effect of “pulling” 
/.0 closer to the small cluster positioned around (2,65). If the underestimation 
of ftf, arises because of the second reason, letting the algorithm iterate longer 
might provide a better result. 

In Figures E]and0 we present the result of registering T2 and PD respectively 
with SCH. The bottom graphs of Figure El show the corresponding intensity 
transformations. For these experiments, c was set to 0.60 x TV, a value we have 
found to be effective for these types of modalities after affine registration. 

One observation that can be made by looking at the graphs of Figure Elis that 
the estimated functions fe and are quite similar in both cases. This suggests 






Multimodal Elastic Matching of Brain Images 523 




Fig. 3. Coronal slices of the T2 to SCH registration result. From left to right: T2; 
SCH; T2 without intensity correction after registration with SCH; T2 with intensity 
correction after registration with SCH. 




Fig. 4. Sagittal slices of the PD to SCH registration result. From left to right: PD; 
SCH; PD without intensity correction after registration with SCH; PD with intensity 
correction after registration with SCH. 



that assuming a monofunctional dependence would be relevant. However, the 
results we obtained when registering T2 with SCH, and PD with SCH, using the 
monofunctional model were less convincing than when using the bifunctional 
model IT^ . 

This may be explained by a closer look at our bifunctional intensity model- 
ling. Equation II l)l reflects the assumption that if an anatomical point has an 
intensity s in S, the corresponding point has an intensity t in T that is distri- 
buted normally around two possible values depending on s. But it makes no 
assumption about how the intensities in S are distributed. This models the in- 
tensities of S without noise, which may not necessarily be well justified, but 
enables the use linear regression to estimate the intensity transformation. 

The effect of noise in S is reflected in the joint histograms by enlarging clu- 
sters along the x axis. This, added to bad matches and partial volume effect, 
creates many outliers in C and makes the assessment of the true intensity trans- 
formation more difficult and more resistant to our robust regression technique. 
Preprocessing of S using for example anisotropic diffusion may narrow the clu- 
sters and provide better results m- 

Adding the estimation of a second function in the bifunctional model helps 
counter the effect of noise on S. For example, the CSF in the PD image has 
intensity values ranging from about 200 to 240 and gray matter from about 175 
to 210. In SCH, these ranges are about 30 to 70 and 55 to 80 respectively. As can 
be seen in Figure El fe models well the gray matter cluster but fails to reflect the 
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Fig. 5. Graphs of the intensity corrections found in our experiments. From left to right, 
top to bottom: Atlas to SCH, T1 to CT, T2 to SCH, PD to SCH. In the last three 
graphs, which correspond to bifunctional models, the red (bright) line represents fe 
and the blue (dark) one ftp. The line width for a given intensity value s in the source 
image corresponds to the value of the corresponding proportion, ne{s). The gray values 
represent the joint histogram after registration. 



CSF transformation. Estimating the second polynomial /,/, solves this problem 
by considering the CSF cluster. 

4.3 Displacement Field Comparison 

Since the atlas, the Tl, the T2 and the PD images have all been registered with 
SCH, it is relevant to compare some statistics of the resulting displacement fields 
to assess if our algorithm provides consistent results across modalities. 

We computed statistics regarding the difference between any two of these dis- 
placement fields. The length of the vectors of the resulting difference fields were 
calculated. Each cell of Table [U presents, for each combination of displacement 
fields, the median length, the average length with the corresponding standard 
deviation and the maximum length of the difference field. 

The two largest average errors are 1.58 mm and 1.76 mm, and were found 
when comparing the Atlas-SCH registration with Tl-SCH and PD-SCH, respec- 
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Table 1. Statistics regarding the displacements difference between each type of regi- 
stration. Each cell presents the median length, the average length with the correspon- 
ding standard deviation and the maximum length. All measures are in millimeters. 



Difference (mm) 


Atlas-SCH 

Tl-SCH 


Atlas-SCH 

T2-SCH 


Atlas-SCH 

PD-SCH 


Tl-SCH 

T2-SCH 


Tl-SCH 

PD-SCH 


T2-SCH 

PD-SCH 


median 


1.46 


1.13 


1.67 


1.00 


1.01 


1.32 


average 


1.58 


1.23 


1.76 


1.18 


1.16 


1.40 


std. dev. 


0.84 


0.63 


0.79 


0.78 


0.71 


0.68 


maximum 


6.99 


5.14 


7.10 


7.17 


8.08 


6.86 



lively. This may be explained by the intensity correction bias for the CSF that 
would tend to attenuate displacements and produce larger errors, a problem in- 
voked in Section o Aside from these, the average error length varies between 
0.97mm and 1.40mm and the median error is between 0.85mm and 1.32mm. 
These are values in the range of the image resolution of 1.0mm. Note also that 
all the standard deviations are below this value. 

Also, we observe that the results obtained when registering images from diffe- 
rent modalities (Atlas-SCH, T2-SCH, and PD-SCH) seem to be consistent with 
the monomodal registration result (Tl-SCH), in which no intensity correction 
was performed. This suggests that the intensity correction may not cause a sen- 
sible degradation of the registration when compared to the monomodal case. We 
point out, however, that these are global measures that are presented to provide 
an idea of the differences between the displacement fields. They do not strictly 
provide a validation of the method, but do show a certain coherence between 
the different results we obtained. 

5 Conclusion 

In this paper, we introduced an original method to perform non-rigid registra- 
tion of multimodal images. This iterative algorithm is composed of two sections: 
the geometrical transformation and the intensity transformation. We have rela- 
ted the geometrical transformation computation to several popular registration 
concepts: SSD, optical flow and the demons method. Two intensity transforma- 
tion models were described which assume either monofunctional or bifunctional 
dependence between the intensities of the images to match. Both of these models 
are built using robust estimators to enable precise and accurate transformation 
solutions. Results of registration were presented and showed that the algorithm 
performs well for several kinds of modalities including T1 MR, T2 MR, PD MR, 
CT and segmentations, and provides consistent results across modalities. 

A current limitation of the method is that it uses Gaussian filtering to re- 
gularize the displacement field. This technique was chosen for its computational 
efficiency rather than for its physical relevance. In the context of inter-subject 
registration, other regularization strategies need to be investigated to better 
account for morphological differences. 



526 



A. Roche et al. 



References 

1. D. C. Barber. Registration of low resolution medical images. Physics in Medecine 
and Biology, 37(7): 1485-1498, 1992. 

2. J. L. Barron, D. J. Fleet, and S. S. Beauchemin. Performance of optical flow 
techniques. International Journal of Computer Vision, 12(l):43-77, January 1994. 

3. Simulated brain database, http://www.bic.mni.mcgill.ca/brainweb/. 

4. J. W. Brandt. Improved accuracy in gradient-based optical flow estimation. In- 
ternational Journal of Computer Vision, 25(l):5-22, 1997. 

5. M. Bro-Nielsen and C. Gramkow. Fast fluid registration of medical images. In 
K. H. Hohne and R. Kikinis, editors, Proc. VBC’96, volume 1131 of Lecture Notes 
in Computer Science, pages 267-276. Springer- Verlag, 1996. 

6. P. Cachier, X. Pennec, and N. Ayache. Fast non rigid matching by gradient des- 
cent: Study and improvements of the “demons” algorithm. Technical Report 3706, 
INRIA, June 1999. 

7. G. E. Christensen, R. D. Rabbitt, and M. I. Miller. Deformable templates using 
large deformation kinematics. IEEE Transactions in Medical Imaging, 5(10):1435- 
1447, October 1996. 

8. C. A. Cocosco, V. Kollokian, R. K.-S. Kwan, and A. C. Evans. Brainweb: Online 
interface to a 3D MRI simulated brain database. Neuroimage, Proc. HBM’97, 
5(4):S425, May 1997. 

9. D. L. Collins, A. P. Zijdenbos, V. Kollokian, J. G. Sled, N. J. Kabani, C. J. Holmes, 
and A. C. Evans. Design and construction of a realistic digital brain phantom. 
IEEE Transactions in Medical Imaging, 17(3):463-468, June 1998. 

10. J. Feldmar, J. Declerck, G. Malandain, and N. Ayache. Extension of the ICP 
algorithm to non-rigid intensity-based registration of 3D volumes. Computer Vision 
and Image Understanding, 66(2):193-206, May 1997. 

11. T. Gaens, F. Maes, D. Vandermeulen, and P. Suetens. Non-rigid multimodal image 
registration using mutual information. In W. M. Wells, A. Colchester, and S. Delp, 
editors, Proc. MICCAP98, volume 1496 of Lecture Notes in Computer Science, 
pages 1099-1106. Springer- Verlag, 1998. 

12. A. Guimond, A. Roche, N. Ayache, and J. Meunier. Multimodal Brain Warping 
Using the Demons Algorithm and Adaptative Intensity Corrections. Technical 
Report 3796, INRIA, November 1999. 

13. P. Hellier, C. Barillot, E. Memin, and P. Perez. Medical image registration with 
robust multigrid techniques. In Proc. MICCAP99, volume 1679 of Lecture Notes 
in Computer Science, pages 680-687, Cambridge, England, October 1999. 

14. B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 
17:185-203, August 1981. 

15. R. K.-S. Kwan, A. C. Evans, and G. B. Pike. An extensible MRI simulator for 
post-processing evaluation. In K. H. Hohne and R. Kikinis, editors, Proc. VBC’96, 
volume 1131 of LNCS, pages 135-140. Springer- Verlag, 1996. 

16. Y. H. Lau, M. Braun, and B. F. Hutton. Non-rigid 3d image registration using re- 
gionally constrainted matching and the correlation ratio. In F. Pernus, S. Kovacic, 
H.S. Stiehl, and M.A. Viergever, editors, Proc. WBIR’99, pages 137-148, 1999. 

17. F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens. Multimoda- 
lity image registration by maximization of mutual information. IEEE Transactions 
in Medical Imaging, 16(2): 187-198, 1997. 

18. J. B. A. Maintz, E. H. W. Meijering, and M. A. Viergever. General multimodal 
elastic registration based on mutual information. In K. M. Hanson, editor. Medical 




Multimodal Elastic Matching of Brain Images 527 



Imaging 1998: Image Processing (MI’98), volume 3338 of SPIE Proceedings, pages 
144-154, Bellingham (WA), USA, April 1998. 

19. A. Papoulis. Probability, Random Variables, and Stochastic Processes. McGraw- 
Hill, Inc., third edition, 1991. 

20. X. Pennec, P. Cachier, and N. Ayache. Understanding the “demon’s algorithm”: 
3d non-rigid registration by gradient descent. In Proc. MICCAP99, volume 1679 
of Lecture Notes in Computer Science, pages 597-605. Springer- Verlag, 1999. 

21. A. Roche, G. Malandain, and N. Ayache. Unifying maximum likelihood approa- 
ches in medical image registration. International Journal of Imaging Systems and 
Technology: Special Issue on 3D Imaging, 2000. In press. 

22. A. Roche, G. Malandain, N. Ayache, and S. Prima. Towards a better compre- 
hension of similarity measures used in medical image registration. In Proc. MIC- 
CAI’99, volnme 1679 of LNCS, pages 555-566. Springer- Verlag, September 1999. 

23. A. Roche, G. Malandain, X. Pennec, and N. Ayache. The correlation ratio as a 
new similarity measure for multimodal image registration. In Proc. MICCAP98, 
volume 1496 of LNCS, pages 1115-1124. Springer- Verlag, October 1998. 

24. P. J. Rousseeuw and K. Van Driessen. Gomputing LTS Regression for Large Data 
Sets. Technical report. Statistics Group, University of Antwerp, 1999. 

25. Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier Detection. 
Wiley series in probability and mathematical statistics. John Wiley & Sons, 1987. 

26. E. P. Simoncelli. Design of multi-dimensional derivative filters. In International 
Conference on Image Processing, Austin, USA, November 1994. IEEE. 

27. J.-P. Thirion. Fast non-rigid matching of 3D medical images. Technical Report 
2547, INRIA, Sophia- Antipolis, 1995. 

28. J.-P. Thirion. Image matching as a diffusion process: an analogy with Maxwell’s 
demons. Medical Image Analysis, 2(3):243-260, 1998. 

29. Arthur W. Toga. Brain Warping. Academic Press, 1999. 

30. P. Viola and W. M. Wells. Alignment by maximization of mutual information. 
International Journal of Computer Vision, 24(2): 137-154, 1997. 

31. R. P. Woods, J. G. Mazziotta, and S. R. Gherry. MRI-PET registration with 
automated algorithm. Journal of Comp. Assist. Tomography, 17(4):536-546, 1993. 




A Physically-Based Statistical Deformable 
Model for Brain Image Analysis 

Christophoros Nikou^’^, Fabrice Heitz^, Jean-Paul Armspach^, and Gloria 

Bueno^’^ 

^ Laboratoire des Sciences de Tlmage de I’lnformatique et de la Teledetection 
Universite Strasbourg I (UPRES-A CNRS 7005) 

4, boulevard Sebastien Brant, 67400 Illkirch, France 
^ Institut de Physique Biologique, Faculte de Medecine 
Universite Strasbourg I (UPRES-A CNRS 7004) 

4, rue Kirschleger, 67085 Strasbourg CEDEX, France 



Abstract. A probabilistic deformable model for the representation of 
brain structures is described. The statistically learned deformable model 
represents the relative location of head (skull and scalp) and brain surfa- 
ces in Magnetic Resonance Images (MRIs) and accommodates their sig- 
nificant variability across different individuals. The head and brain surfa- 
ces of each volume are parameterized by the amplitudes of the vibration 
modes of a deformable spherical mesh. For a given MRI in the training 
set, a vector containing the largest vibration modes describing the head 
and the brain is created. This random vector is statistically constrained 
by retaining the most signihcant variation modes of its Karhunen-Loeve 
expansion on the training population. By these means, the conjunction of 
surfaces are deformed according to the anatomical variability observed in 
the training set. Two applications of the probabilistic deformable model 
are presented: the deformable model-based registration of 3D multimodal 
(MR/SPECT) brain images without removing non-brain structures and 
the segmentation of the brain in MRI using the probabilistic constraints 
embedded in the deformable model. The multi-object deformable model 
may be considered as a first step towards the development of a general 
purpose probabilistic anatomical brain atlas. 



1 Introduction 

In medical image analysis, deformable models offer a unique and powerful ap- 
proach to accommodate the significant variability of biological structures over 
time and across different individuals. A survey on deformable models as a pro- 
mising computer-assisted medical image analysis technique has recently been 
presented in 0. 

We present a 3D statistical deformable model carrying information on multi- 
ple anatomical structures (head -skull and scalp- and brain) for multimodal brain 
image processing. Our goal is to describe the spatial relation between these ana- 
tomical structures as well as the shape variations observed over a representative 
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population of individuals. In our approach, the different anatomical structu- 
res are represented by physics-based deformable models m whose parameters 
undergo statistical training. The resulting joint statistical deformable model is 
considered as a first step towards the development of a general purpose proba- 
bilistic atlas for various applications in medical image analysis (segmentation, 
labeling, registration, pathology characterization). 

In the proposed approach the considered anatomical structures surfaces are 
extracted from a training set of 3D MRI. These surfaces are then parameterized 
by the amplitudes of the vibration modes of a physically-based deformable model 
[II I l)j and a joint model is constructed for each set of structures. The joint 
model is then statistically constrained by a Karhunen-Loeve decomposition of 
the vibration modes. By these means, the spatial relation between the head and 
brain structures, as well as the anatomical variability observed in the training 
set are compactly described by a limited number of parameters. 

Physics-based models enable a hierarchical description of anatomical struc- 
tures as the ordered superimposition of vibrations (of different frequencies) of 
an initial mesh. Physically-based parameterizations are also invariant to small 
misregistration in rotation (contrary to Point Distribution Models (PDMs) 0, 
needing accurate rotation and translation compensation). Let us notice that 
physically-based models also differ from 3D Fourier descriptors because the latter 
also need a uniform way to discretize the surface and are not rotation invariant 

Two applications of the probabilistic deformable model are presented in this 
paper: 

— The segmentation of the brain from MRIs using the probabilistic constraints 
embedded in the deformable model. 

— The robust deformable model-based rigid registration of 3D multimodal 
(MR/SPECT) brain images by optimizing an energy function relying on 
the chamfer distance between the statistically constrained model parts and 
the image data. 

The remainder of this paper is organized as follows: in Section |2l the pa- 
rameterization of the head and brain structures by the vibration modes of a 
spherical mesh is presented. The statistical training procedure is described in 
Section 0 The applications of the probabilistic model to 3D segmentation and 
to multimodal (MRI/SPECT) image registration are presented in Section 2] Ex- 
perimental results on real data, with a 50-patients trained model, are presented 
and commented on in the same section. Finally, conclusions are proposed in 
Section 0 

2 3D Physics-Based Deformable Modeling 

To provide a training set, a representative collection of 50 3D MRI volumes 
of different patients have first been registered to a reference image using an 
unsupervised robust rigid registration technique 1 1 I II 2j . This preliminary step is 
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necessary to provide a consistent initialization for the deformable model for all 
images in the training step, since the representation is not invariant to rotation 
(the same alignment is also applied to the patient data processed in Section 2]) . 
The head of each volume has then been segmented by simple thresholding and 
region growing 0. 

Both head and brain contours were parameterized by the amplitudes of the 
vibration modes of a physics-based deformable model. Following the approach of 
Nastar et al. ma, the model for a given structure consists of 3D points sampled 
on a spherical surface, following a quadrilateral cylinder topology in order to 
avoid singularities due to the poles. Each node has a mass m and is connected 
to its four neighbours with springs of stiffness k. The model nodes are stacked 
in vector: 



^0 — [XitVi ) ^1 J ^N'Nj Vn'Nj ^N'N) 



( 1 ) 



where N is the number of points in the direction of the geographical longitude 
and N' is the number of points in the direction of the geographical latitude of the 
sphere. The physical model is characterized by its mass matrix M, its stiffness 
matrix K and its dumping matrix C and its governing equation may be written 
as j 1 3) : 

MU-y CU-yKU = F (2) 

where U stands for the nodal displacements of the initial mesh Xg. The image 
force vector F is based on the euclidean distance between the mesh nodes and 
their nearest contour points |21. 

Since equation Q is of order 3NN' , where NN' is the total number of nodes 
of the spherical mesh, it is solved in a subspace corresponding to the truncated 
vibration modes of the deformable structure muni, using the following change 
of basis: 

U = ^ (3) 

i 

where is a matrix and U is a vector, (pi is the column of $ and Ui is the 
scalar component of vector U. By choosing $ as the matrix whose columns 
are the eigenvectors of the eigenproblem: 



K(pi = ujfMpi, 



( 4 ) 



and using the standard Rayleigh hypothesis nm, matrices K, M and C are 
simultaneously diagonalized: 



r = I 



( 5 ) 



where is the diagonal matrix whose elements are the eigenvalues ujf and I is 
the identity matrix. 

An important advantage of this formulation is that the eigenvectors and 
the eigenvalues of a quadrilateral mesh with cylinder topology have an explicit 
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expression Q and they do not have to be computed by standard slow eigen- 
decomposition techniques (generally matrices K and M are very large). The 
eigenvalues are given by the equation: 



2 4fc / , 2 Ptt . 2 



and the eigenvectors are obtained by: 



?p,p' 



(2n — l)pTT 2n'p'TT 
cos — cos ■ 



1 T 



2N 



N' 



with n G {1, 2, N} and n' £ {1, 2, N'}. 

Substituting (EJ into 0 and premultiplying by yields: 



( 6 ) 

( 7 ) 



tj -h CU -h = F (8) 

where C = and F = 4>^F. 

In many computer vision applications H3!, when the initial and the final state 
are known, it is assumed that a constant load F is applied to the body. Thus, 
equation m is called the equilibrium governing equation and corresponds to the 
static problem: 

KU = F (9) 

In the new basis, equation (0 is thus simplified to 3NN' scalar equations: 

= ft- ( 10 ) 

In equation (IIUI) . u>i designates the eigenvalue, the scalar Ui is the amplitude 
of the corresponding vibration mode (corresponding to eigenvector (pi). Equa- 
tion PD, indicates that instead of computing the displacements vector U from 
equation 0 , we can compute its decomposition in terms of the vibration modes 
of the original mesh. 

The number of vibration modes retained in the object description, is chosen 
so as to obtain a compact but adequately accurate representation. A typical a 
priori value covering many types of standard deformations is the quarter of the 
number of degrees of freedom in the system m (i.e. 25% of the modes are kept). 
Figure E shows the parameterization of head and brain surfaces considered for 
a subject belonging to the training set, by the 25% lowest frequency modes. 
Although not providing a high resolution description of the brain surface, this 
truncated representation provides a satisfactory compromise between accuracy 
and complexity of the representation. The spherical model is initialized around 
the structures of interest (fig. 0 a) and 0d)). The vibration amplitudes are 
explicitly computed by equation dniD, where rigid body modes {uji = 0) are 
discarded and the nodal displacements may be recovered using equation 0 . 
The physical representation X(U) is finally given by applying the deformations 
to the initial spherical mesh (fig.0b-c) and 0e-f)): 



X(U) = Xo -k 



( 11 ) 
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Thus, the head and brain surfaces of a particular patient are hierarchically 
described in terms of vibrations of an initial spherical mesh. The next step 
consists in applying the above parameterization to each patient of the training 
set and to perform statistical learning for the head and brain structures. 




Fig. 1. Head and brain parameterization from 3D MRI. The first column shows in a 
multiplanar (sagittal, coronal, transversal) view the initial spherical mesh superimposed 
to the struetures to he parameterized. The midlle column presents in a multiplanar view 
the deformable models at equilibrium (25% of the modes). The last column illustrates 
3D renderings of the physically-based models. The rows from top to bottom correspond 
to: (a)-(c) head and (d)-(f) brain. 



3 Statistical Training: The Joint Model 

For each image i = l,...,n (n = 50) in the training set, a vector a^containing 
the lowest frequency vibration modes, Mfj and Mb, describing the head and the 
brain, respectively, is created: 



tB\T 



a, = (Uf,Uf) 



(12) 
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where: 



tjf = ( 13 ) 

ijf = {ul,ul...,u\,J. (14) 

with 3{Mh + Mb) < 6NN'. 

Random vector a is statistically constrained by retaining the most significant 
variation modes in its Karhunen-Loeve (kl) transform [415 1 : 



a = a -h Pb 



(15) 



where 



a = 



1 

n 



a, 



(16) 



is the average vector of vibration amplitudes of the structures belonging to the 
training set, P is the matrix whose columns are the eigenvectors of the covariance 
matrix 

r = E[(a-a)^(a-a)] (17) 

and 

bi = P^(ai - a) (18) 

are the coordinates of (a — a) in the eigenvector basis. 

The deformable model is finally parameterized by the m most significant sta- 
tistical deformation modes stacked in vector b. By modifying b, both head and 
brain are deformed in conjunction (fig.|2), according to the anatomical variabi- 
lity observed in the training set. The multi-object deformable model describes 
the spatial relationships between the considered surfaces of a subject as well as 
their shape variations. 

Given the double (head and brain) initial spherical mesh: 



X/AT/T = 




(19) 



the statistical deformable model X(a) is thus represented by: 



X(a) = X/at/t + 



( 20 ) 



Combining equations (El and we have: 



where: 



X(b) = X/at/t T ^a -|- $Pb 



$ = 



/ 0 \ 
\ 0 





( 21 ) 

( 22 ) 



In equation 11221 . the columns of the 3NN' x 3Mh matrix are the eigen- 
vectors of the spherical mesh describing the head surface and the columns of the 
3NN' X 3Mb matrix are the eigenvectors of the spherical mesh describing 
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the brain surface. Besides, the 3M// x m matrix Phb and the SMb x m matrix 
P bh describe the statistical dependencies of head and brain deformations ob- 
served in the training set. Vectors an and are of order 3Mh x 1 and 3Mb x 1 
respectively, and vector b has a low dimension m <C 3 {Mh + Mb)- 




a) b[l] = -STV b) b[l] =0 c) b[l] = syXT 




d) b[2] = -3^/A^ e) b[2] = 0 f) b[2] = 3^A2 

Fig. 2. Multiplanar view of the 3D joint model’s deformations by varying the first two 
statistical modes in vector b between —\fXi and i = 1,2. Ai designates the 

eigenvalue of the covariance matrix F. 



As it can be seen in Tabled with the KL representation, only a few parameters 
are necessary to describe the variations in the training population (fig.|3). Tabled 
shows that, for instance 5 parameters carry approximately 95% of the global 
information. 

The number of degrees of freedom of the original mesh, for both head and 
brain surfaces, was 2 x liNN = 2 x 3 x 100 x 100 = 60000. In the vibration modes 
subspace, this number was reduced to 3{Mh + Mb) = 3 x (2500-1-2500) = 15000 
and finally in the KL subspace the degrees of freedom were reduced to m ~ 5 
achieving a compression ratio of 12000 : 1. This compression ratio enables a 
compact description of shape variability, and results in a tractable constrained 
deformable model for brain image segmentation and registration, as described 
in the next section. 
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Table 1. Percentage of the global information carried by the different eigenvalues as- 
sociated with the statistical model. The total number of non-zero eigenvalues is 50. 



KL decomposition of joint model variability 




4 Applications 

Several applications of the statistical model may be considered in brain image 
processing. The model may be used as a simplified anatomical representation 
of the images belonging to the training set. If the training set is representative 
enough of a population, the model may also be used to analyse images of patients 
not belonging to the training set. To this end, the 50 subjects of our data base 
were carefully selected, with the aid of an expert neurologist. Besides, the data 
base is conceived in such a way that it can be incrementally augmented by new 
elements. 

We consider here two applications of the joint statistical model: the segmenta- 
tion of the brain from 3D MRI and the registration of multimodal (MRI/SPECT) 
brain images. Before presenting these two applications, let us notice that the 
equation describing the configuration of the statistical model: 



X(b) = Xinit + -h $Pb (23) 

may be separated into two equations describing the head and brain parts of the 
model: 



Xnih) = Xo -I- + 4’ffPffsb (24) 

Xs(b) =Xo + ^BaB + $sPBffb (25) 

Let us also recall that equations (I24II and are coupled by the sub-matrices 
Phb and Pbh representing the statistical dependencies (spatial relationships) 
between the two anatomical structures. These submatrices cannot be calculated 
separately: they are parts of matrix P. The terms an and express the mean 
vibration amplitudes for the head and brain surfaces of the training set, Xg is 
the initial spherical mesh and and 4 >b denote its eigenvectors. 
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4.1 Brain Segmentation 

In order to segment the brain from a patient MRI volume, not belonging to 
the training set, the patient head is first parameterized by the physics-based 
model. The head structure is easily segmented from its background by simple 
thresholding and region growing algorithms. The segmented head surface is pa- 
rameterized by the amplitudes of the vibration modes of a spherical mesh, as 
already explained in Section El The spherical mesh is initialized around the head 
structure and equation is solved in the modal subpsace. The solution for the 
vibration amplitudes describing the patient head surface is: 



for i = 1, ...,3 Mh ■ The head surface coordinates are obtained by introducing 
vector = {ui, U 2 , . . . , equation 



The next step consists in determining the statistical model parameters b 
describing “at best” the segmented head surface: 



coordinates X//) and m unknowns (the components of b). Moreover, matrix 
Phb, describing the head and brain surfaces spatial relation, constrains vector b 
to describe both head and brain surfaces. Further regularization may be obtained 
by adding a strain-energy minimization constrain nni: 



where A = diag{Xi} contains the eigenvalues of the covariance matrix F. Strain 
energy enforces a penalty proportional to the squared eigenvalue associated with 
each component of b. 

The solution of is formulated in terms of minimization of a regularized 
least squares error: 

E{h) = [Xff (U^) - Xo - (U^) - Xq (30) 

- - 4>//P//sb] + ab^A^b 

Differentiating with respect to b, we obtain the strain-minimizing overcon- 
strained least squares solution: 



b* = + (31) 



A first estimate of the patient’s brain surface is then recovered by introducing 
the estimated parameter b* in equation (E3)) describing the brain part of the 
statistical model: 




(26) 



Xj^(U^) =Xo + $ffU^ 



(27) 





(29) 



Xs(b*) = Xo + ^BaB + ^BPBHh* 



(32) 
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Equation provides a good initial prediction of the location of the brain sur- 
face, obtained by exploiting the spatial relationships between head and brain, 
coded in the learned statistical representation. This feature of the proposed ap- 
proach significantly alleviates the problem of manual initialization which is a 
requirement in most of the deformable model-based segmentation methods. 

Further improvement of this initial solution may be obtained by alternately 
optimizing an energy function parameterized by the m components of vector b 
fS], in order to fit the part of the model describing the brain, X^, to a noisy 
contour map Ic extracted from the MRI image 0. In our case, the cost function 
E to be optimized is defined as: 

3Nn' 

E{h)= Y. ^G*h{p) (33) 

pGXB(b)|p=l 

where the operator Vq denotes the gradient of a Gaussian kernel. The above 
cost function simply counts the number of points of the model located on a 
contour point of the smoothed brain image. Optimization of energy function 
(tI3H is obtained by a non linear Gauss-Seidel like algorithm, known as IGM |5|. 
It has fast convergence properties and only accepts configurations decreasing the 
cost function. 

To summarize, the overall segmentation algorithm is based on the following 
steps: 

1. Parameterization of the head surface using equations ll'z!bll and m- 

2. Estimation of the statistical deformation parameters b* by solving the re- 
gularized overconstrained system (SU). 

3. Prediction of the brain surface by equation (ES). 

4. Fine-tuning of the solution by deterministic optimization of cost function 

(ESI. 

Figure Q presents a typical example of brain segmentation from a 3D MRI, 
corresponding to a patient not belonging to the training set. The image in figure 
E|a) is a post-operative MRI (thus exhibiting missing data). In figure 0|b) the 
head surface is segmented and parameterized by the physics-based deformable 
model (eq. (OHIl and J22D). In fig. E|b), the head surface coordinates combined 
with the probabilistic model provide a good prediction for the brain surface. 
The statistical model is not affected by missing data because its deformations 
are constrained by the statistical analysis of the shape variations observed in the 
training population. The whole segmentation process takes about 5 min cpu time 
on a standard (HP 9000/G200) workstation for a 128^ image volume. Most of 
the computation time concerns head surface parameterization and especially the 
image forces based on the euclidean distance transform of the 3D MR image E|. 

4.2 Multimodal Image Registration 

The second application considered in this paper concerns the rigid registration 
of multimodal (MR/SPEGT) 3D images. Registration of a multimodal image 
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a) b) c) 



Fig. 3. (a) A patient MR image with the initial spherical mesh superimposed, (b) 

Prediction of the brain surface using the head surface and the probabilistic deformable 
model, (c) 3D rendering of the segmented brain. 



pair consists in estimating the rigid transformation parameters (3D rotation and 
translation parameters) that have to be applied to the image to be registered 
(here the SPECT image) in order to match the reference image (here the MRI). 

The registration relies on the head structure in the MRI and the brain struc- 
ture in the SPECT image, which are easy to extract from these two modalities 
(contrary to the brain structure in MRI). These structures do not overlap but 
the deformable model represents the relative location of the head and brain con- 
tours and accounts for the anatomical variability observed among the training 
population. The deformable model (restricted here to head and brain surfaces) 
is used as a probabilistic atlas that constrains the rigid registration of the image 
pair. 

The multimodal rigid registration method relies on the following steps: 

1. Segmentation of the head structure in MRI and the brain structure in 
SPECT from their backgrounds. 

2. Brain surface recovery from the MRI using the segmentation algorithm pre- 
sented in section I^TI 

3. Registration of the estimated brain surface with the SPECT brain surface 
by optimization of a cost function. 

The first step is standard preprocessing for background noise elimination. 
The second step estimates the brain surface from the MRI using the head sur- 
face parameterization and the statistical deformable model. By these means, 
multimodal image registration is also a measure for the accuracy of the segmen- 
tation process. Finally, the third step brings into alignment the estimated MRI 
brain surface and the SPECT image surface by optimization of an objective fun- 
ction having as variables the rigid transformation parameters between the two 
surfaces. Various cost functions may be used in that step for the registration of 
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binary surfaces. We have applied the following energy function: 

E{0) = Y. iDiTeip)) (34) 

PGISPECT 

where Tq is the rigid transformation with parameters 0 = {tx,ty,tzT9x,0y,Oz}, 
pis a, voxel of the SPECT image surface Ispect and Id is the chamfer distance 
transformation |3] of the part of the statistical model describing the brain. For 
all of the SPECT surface voxels, equation (I34j) counts the distance between 
a SPECT image surface point and its nearest point on the deformable model 
surface. We have chosen chamfer distance matching because it is fast and it is 
easily generalized to any surfaces. The whole registration procedure takes about 
10 min cpu time on a HP C200 workstation for a 128^ image volume. 

Figure0shows an example of a MRI/SPECT registration using the proposed 
technique. The images in figure 0(a) show the two volumes before registration. 
The SPECT contours are superimposed onto the MRI to qualitatively evaluate 
the registration. Figure 0(b) presents the head and brain surface recovery of the 
MRI using the segmentation algorithm described in the previous section. The 
matching of the SPECT volume to the part of the model describing the brain 
is illustrated in figiDc). The images in figure 0(d) show the two volumes after 
registration. As can be seen, although the MRI and SPECT head and brain 
contours do not overlap, the two images have been correctly registered using the 
statistical model. 

To quantitatively assess the ability of the physics-based statistical deformable 
model to handle multimodal image pairs, a 3D SPECT image volume has been 
manually registered to its corresponding MRI volume with the aid of an expert 
physician. The manually registered SPECT volume was then transformed using 
translations between —20 and -1-20 voxels and rotations between —30 and -1-30 
degrees. By these means 25 new images were created. These images were then 
registered using three different techniques and statistics on the registration errors 
were computed on the set of 25 different registrations. We have compared our 
Statistical Deformable Model-based technique (SDM) to the maximization of the 
Mutual Information (MI) ^ (currently considered as a reference method) and 
the Robust Inter-image Uniformity criterion (RIU) developed by the authors [ED 
[□]. Both of the latter techniques have been validated in previous studies and 
are robust to missing data, outliers and large rotations. For each method, the 
estimated registration parameters, that is the 3D translations (tx, ty, 1^) and 
rotations {9x, 9y, 9z) were compared to the true ones to determine the accuracy 
of the registration. Tables El and 0 show the mean, the standard deviation, the 
median and maximum of the registration errors for the different techniques. As 
can be seen the proposed SDM approach leads to a registration accuracy which 
is close to the two other methods. 

5 Conclusion and Future Prospects 

We have presented a physically-based 3D statistical deformable model embed- 
ding information on the spatial relationships and anatomical variability of mul- 
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Fig. 4. MRI/SPECT registration using the deformable model, (a) MRI and SPECT 
volumes before registration. The SPECT contours are superimposed onto the MRI to 
illustrate the misalignment, (b) Parameterization of the head structure and estima- 
tion of the brain surface of the MR image in (a) using the statistically constrained 
deformable model, (c) Registration of the SPECT image to the part of the statistical 
model describing the brain surface, (d) MRI and SPECT volumes after registration. 
The registered SPECT image contours are superimposed onto the MRI to illustrate the 
alignment of the two images. 
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Table 2. Multimodal registration of 3D MRI/SPECT images. A 3D SPECT image 
volume manually pre-registered by an expert to its MRI counterpart was artificially 
transformed using 25 different translation and rotation parameters. The average and 
the standard deviation of the registration errors are presented for the different methods. 
Translation errors are given in voxels and rotation errors in degrees. 



3D MRI/SPECT Registration Errors (/i ± cr) 





MI 


RIU 


SDM 


Atx 


1.33 ± 1.16 


0.47 ±0,41 


0.89 ±0.43 


Aty 


1.61 ± 1.06 


1.13 ±0,90 


0.86 ±0.88 


Ac 


1.06 ± 1.19 


1.08 ±0,74 


1.05 ± 1.02 


A9^ 


1.26 ± 1.09 


0.75 ±0,56 


1.15± 1.11 


A6y 


1.60 ±0.92 


0.58 ±0,44 


1.28 ±0.87 


AO, 


0.99 ±0.86 


1.04 ±0,78 


1.29 ±0.67 



Table 3. Multimodal registration of 3D MRI/SPECT images. A 3D SPECT image 
volume manually pre-registered by an expert to its MRI counterpart was artificially 
transformed using 25 different translation and rotation parameters. The median and 
maximum registration errors for the rigid transformation parameters are presented. 
See text for technique abbreviations. 



3D MRI/SPECT Registration Errors 





MI 


RIU 


SDM 


median (At) 


1.35 


0.63 


0.54 


maximum) At) 


4.24 


3.05 


2.63 


median(A^) 


1.14 


0.52 


1.09 


maximum) A0) 


4.35 


2.47 


3.52 



tiple anatomical structures, as observed over a representative population. The 
particular model developed in this paper was devoted to head and brain repre- 
sentation. Applications of this model included the registration of multimodal 
image pairs (MRI/SPECT) and the unsupervised segmentation of the brain 
structure from a given modality (MRI). The major advantage of statistical mo- 
dels is that they naturally introduce a priori statistical knowledge that provides 
useful constraints for ill-posed image processing tasks, such as image segmenta- 
tion. Consequently they are less affected by noise, missing data or outliers. As 
an example, the statistical deformable model was applied to the segmentation 
of the brain structure from post operative images, in which missing anatomi- 
cal structures lead standard voxel-based techniques to erroneous segmentations. 
The registration of multimodal brain images was also handled without perfor- 
ming any preprocessing to remove non-brain structures. 

One perspective of our work is to extend the model by representing other 
anatomical structures of the brain (ventricles, corpus callosum, hippocampus, 
etc.). The statistical deformable model presented in this paper may be conside- 
red as a first step towards the development of a general purpose probabilistic 
anatomical atlas of the brain. 
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Abstract. This paper presents a new method to find minimal paths in 
3D images, giving as initial data one or two endpoints. This is based on 
previous work [1] for extracting paths in 2D images using Fast Marching 
[4]. Our original contribution is to extend this technique to 3D, and give 
new improvements of the approach that are relevant in 2D as well as in 
3D. We also introduce several methods to reduce the computation cost 
and the user interaction. This work finds its motivation in the particular 
case of 3D medical images. We show that this technique can be efficiently 
applied to the problem of finding a centered path in tubular anatomical 
structures with minimum interactivity, and we apply it to path construc- 
tion for virtual endoscopy. Synthetic and real medical images are used 
to illustrate each contribution. 

keywords : Deformable Models, Minimal paths. Level Set methods. 
Medical image understanding, Eikonal Equation, Fast Marching. 



1 Introduction 

In this paper we deal with the problem of finding a curve of interest in a 3D 
image. It is defined as a minimal path with respect to a Potential P. This 
potential is derived from the image data depending on which features we are 
looking for. With classical deformable models P|, extracting a path between two 
fixed extremities is the solution of the minimization of an energy composed of 
internal and external constraints on this path, needing a precise initialization. 
Similarly, defining a cost function as an image constraint only, the minimal path 
becomes the path for which the integral of the cost between the two end points 
is minimal. Simplifying the model to external forces only, Cohen and Kimmel 
|[P solved this minimal path problem in 2D with a front propagation equation 
between the two fixed end points, using the Eikonal equation (that physically 
models wavelight propagation), with a given initial front. Therefore, the first 
step is to build an image-based measure P that defines the minimality property 
in the studied image, and to introduce it in the Eikonal equation. The second 
step is to propagate the front on the entire image domain, starting from an 
initial front restricted to one of the fixed points. The propagation is done using 
an algorithm called East Marching 0. 
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The original contribution of our work is to adapt to 3D images the minimal 
path technique developed in [Q . We also improve this technique by reducing the 
computing cost of front propagation. For the particular case of tubular anato- 
mical structures, we also introduce a method to compute a path with a given 
length with only one point as initialization, and another method to extract a 
centered path in the object of interest. 

Deformable models have been widely used in medical imaging [7j. The main 
motivation of this work is that it enables almost automatic path tracking routine 
in 3D medical images for virtual endoscopy inside an anatomical object. An 
endoscopy consists in threading a camera inside the patient’s body in order 
to examine a pathology. The virtual endoscopy process consists in rendering 
perspective views along a user-defined trajectory inside tubular structures of 
human anatomy with CT or MR 3D images. It is a non-invasive technique which 
is very useful for learning and preparing real examinations, and it can extract 
diagnostic elements from images. This new method skips the camera and can 
give views of region of the body difficult or impossible to reach physically (e.g. 
brain vessels). A major drawback in general remains when the user must define 
all path points manually. For a complex structure (small vessels, colon,...) the 
required interactivity can be very tedious. If the path is not correctly built, it 
can cross an anatomical wall during the virtual fly-through. 

Our work focuses on the automation of the path construction, reducing inter- 
actions and improving performance, given only one or two end points as inputs. 
We show that the Fast Marching method can be efficiently applied to the pro- 
blem of finding a path in virtual endoscopy with minimum interactivity. We also 
propose a range of choices for finding the right input potential P. 

In section |21 we summarize the method detailed in ^ for 2D images. In 
section 0 we extend this method to 3D, and we detail each improvement made 
on the front propagation technique. In section ^ we explain how to extract 
centered paths in tubular structures. And in section 0 we apply our method to 
colon and brain vessels. 

2 The Cohen-Kimmel Method in 2D 

2.1 Global Minimum for Active Contours. 

We present in this section the basic ideas of the method introduced by Cohen 
and Kimmel (see fP for details) to And the global minimum of the active contour 
energy using minimal paths. The energy to minimize is similar to classical defor- 
mable models (see 0) where it combines smoothing terms and image features 
attraction term (Potential P): 

E{C) = J^{w,\\C\s)f + W 2 \\C”{s)f + P{C{s))]ds . (1) 

where C{s) represents a curve drawn on a 2D image, f2 is its domain of definition 
[0, L], and L is the length of the curve. It reduces the user initialization to giving 
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the two end points of the contour C. In P , the authors have related this problem 
to the new paradigm of the level-set formulation. In particular, its Euler equation 
is equivalent to the geodesic active contours jSj . They introduced a model which 
improves energy minimization because the problem is transformed in a way to 
find the global minimum, avoiding being sticked in local minima. Most of the 
classical deformable contours have no constraint on the parameterization s, thus 
allowing different parameterization of the contour C to lead to different results. 
In P, contrary to the classical snake model (but similarly to geodesic active 
contours), s represents the arc-length parameter. Considering a simplified energy 
model without a second derivative term leads to the expression 

E{C) = [ {w + P{C{s))}ds . (2) 

J n 

We now have an expression in which the internal forces are included in the 
external potential. The regularization is now achieved by the constant w > Q. 
Given a potential P > 0 that takes lower values near desired features, we are 
looking for paths along which the integral of P = P + w is minimal. We can 
define the surface of minimal action C7, as the minimal energy integrated along 
a path between a starting point po and any point p: 

U{p) = inf E{C) = inf | [ P(C(s))d4 . (3) 

where Apg^p is the set of all paths between po and p. The minimal path between 
Pq and any point pi in the image can be easily deduced from this action map. 
Assuming that potential P is always positive, the action map will have only one 
local minimum which is the starting point po, and the minimal path will be found 
by a simple back-propagation on the energy map. Thus, contour initialization is 
reduced to the selection of the two extremities of the path. 



2.2 Fast Marching Resolution. 

In order to compute this map U, a front-propagation equation related to equa- 
tion (0 is solved : ^ = i'Tt. It evolves a front starting from an infinitesimal 
circle shape around po until each point inside the image domain is assigned a 
value for U. The value of U{p) is the time t at which the front passes over the 
point p. Then it notifies the shortest path energy to reach the start point from 
any point in the image. 

The fast marching technique, introduced by Sethian (see 0), was used by Cohen 
and Kimmel P noticing that the map U satisfies the Eikonal equation: 

||V17|| = P. (4) 

Classic finite difference schemes for this equation tend to overshoot and are un- 
stable. Sethian ^ has proposed a method which relies on a one-sided derivative 
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that looks in the up-wind direction of the moving front, and thereby avoids the 
over-shooting of finite differences. At each pixel the unknown u satisfies: 



giving the correct viscosity-solution u for Uij. The improvement made by the 
Fast Marching is to introduce order in the selection of the grid points. This order 
is based on the fact that information is propagating outward, because action can 
only grow due to the quadratic equation ( 0 . 

The algorithm is detailed in 3D in next section in table 0 The fast marching 
technique selects at each iteration the Trial point with minimum action value. 
This technique of considering at each step only the necessary set of grid points 
was originally introduced for the construction of minimum length paths in a 
graph between two given nodes in 0. 

Thus it needs only one pass over the image. To perform efficiently these opera- 
tions in minimum time, the Trial points are stored in a min-heap data structure 
(see details in 0). Since the complexity of the operation of changing the value of 
one element of the heap is bounded by a worst-case bottom-to-top proceeding of 
the tree in 0(log2 N), the total work is about 0{N log 2 N) for the fast marching 
on a iV points grid. 

3 3D Minimal Path Extraction 

We are interested in this paper in finding a curve in a 3D image. The application 
that motivates this problem is detailed in section 0 It can also have many other 
applications. Our approach is to extend the minimal path method of previous 
section to finding a path C{s) in a 3D image minimizing the energy: 



where 17 = [0,L], L being the length of the curve. We first extend the Fast 
marching method to 3D to compute the minimal action U . We then introduce 
different improvements for finding the path of minimal action between two points 
in 2D as well as in 3D. In the examples that illustrate the approach, we see various 
ways of defining the potential P. 

3.1 3D Fast-Marching 

Similarly to previous section, the minimal action U is defined as 



(max{u- Ui-ij,u- t7i+ij,0})^ -|- 
(max{w- Uij-i,u- = P^j . 



( 5 ) 




P{C{s))ds . 



( 6 ) 




( 7 ) 
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where Ap„^p is now the set of all 3D paths between po and p. Given a start point 
Pq, in order to compute U we start from an initial infinitesimal front around Pq. 
The 2D scheme equation m developed in 0 is extended to 3D, leading to : 

(max{ii j fc, iL 0}) -|- 

(max{u - u - 0})^ + (8) 

(max^ii j u 0}) — ■ 

giving the correct viscosity-solution u for Uij^k- Considering the neighbors of 
grid point in 6-connexity, we study the solution of the equation J3) in 

table E 



Algorithm for 3D Up-Wind Scheme 

We note {Ai, A 2 }, {Bi, B 2 } and {Gi, G 2 } the three couples of opposite neighbors of 
with the ordering Uai < Uas, Ubi < Ub^, Uci < Uc 2 , and Uai <Ubi < 
Uci ■ Three different cases are to be examined sequentially: 

1. Considering that we have u > Uci > Ubi > Uai, the equation derived is 

(u - UA,f + {u - UB,f + {u - Uc,f = . (9) 

Computing the discriminant Ai of equation 0 we have two cases 
— If Ai > 0, u should be the largest solution of equation (0; 

— If the hypothesis u > Uci is wrong, go to 2; 

— If this value is larger than Uci , go to 4; 

— If Zii < 0, it means that at least Gi has an action too large to influence 
the solution and that the hypothesis u> Uci is false. Go to 2; 

2. Considering that u > Ubi > Ubi and u < Uci , the equation derived is 

[u - UA^f + {u - UB,f = . ( 10 ) 

Computing the discriminant ZI 2 of equation 1101 we have two cases 
— If A 2 > 0, u should be the largest solution of equation dini; 

— If the hypothesis u > Ubi is wrong, go to 3; 

— If this value is larger than Ubi , go to 4; 

— If Zi 2 < 0, Bi has an action too large to influence the solution. It means 
that u > Ubi is false. Go to 3; 

3. Considering that u < Ubi and u > Ua^ , we finally have u = Ua^ -f P- Go to 4; 

4. Return u. 



Table 1. Solving locally the upwind scheme 

We extend the Fast Marching method, introduced in 0 and used by Cohen 
and Kimmel |3 to our 3D problem. The algorithm is detailed in table El 

3.2 Several Minimal Path Extraction Techniques 

In this section, different procedures to obtain the minimal path between two 
points are detailed. After discussing the previous backpropagation method, we 
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Algorithm for 3D Fast Marching 

— Definition: 

— Alive is the set of all grid points at which the action value has been reached 
and will not be changed; 

— Trial is the set of next grid points to be examined and for which an estimate 
of U has been computed using algorithm of Table Q 
— Far is the set of all other grid points, for which there is not yet an estimate 
for U-, 

— Initialization: 

— Alive set is confined to the starting point po; 

— Trial - the initial front is confined to the neighbors of po; 

— Far is the set of all other grid points; 

— Loop: 

— Let {imin,jmin, kmin) be the Trial point with the smallest action [/; 

— Move it from the Trial to the Alive set (i.e. Limin jmin.fcmin i® frozen); 

- For each neighbor (i,j,k) (6-connexity in 3D) of {imin, jmin, k-mi-n)- 

• If {i,j, k) is Far, add it to the Trial set and compute U using tabled 

• If {i,j, k) is Trial, recompute the action Uij,k, and update it if the new 
value computed is smaller. 



Table 2. Fast marching algorithm 

study how we can limit the front propagation to a subset of the image domain, for 
speeding-up execution. We illustrate the ideas of this section on two synthetic 
examples of 3D front propagation in figures Q and 01 To make the following 
ideas easier to understand, we show examples in 2D in this section. Examples of 
minimal paths in 3D real images are presented for the application in Section O 



Minimal path by back-propagation The minimal action map U computed 
according to the discretization scheme of equation d3) is similar to convex, in 
the sense that its only local minimum is the global minimum found at the front 
propagation start point po where U{po) = 0. The gradient of U is orthogonal 
to the propagating fronts since these are its level sets. Therefore, the minimal 
action path between any point p and the start point po is found by sliding back 
the map U until it converges to Pq. It can be done with a simple steepest gradient 
descent, with a predefined descent step, on the minimal action map U, choosing 
Pn+i = Pra — step X ’S/U{pn)- See in figure ^middle the action map corresponding 
to a binarized potential defined by high values in a spiral rendered in figure ^ 
middle. The path found between a point in the center of the spiral and another 
point outside is shown in figure E-right by transparency. 



Partial front propagation. An important issue concerning the back-propa- 
gation technique is to constrain the computations to the necessary set of pixels 
for one path construction. Finding several paths inside an image from the same 
seed point is an interesting task, but in the case we have two fixed extremities 
as input for the path construction, it is not necessary to propagate the front on 
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Potential P = spire action map with P = spire 3D path in the spire 



Fig. 1. Examples on synthetic potentials 

all the image domain, thus saving computing time. In figure |3 is shown a test 
on an angiographic image of brain vessels. We can see that there is no need to 
propagate further the points examined in figure Bright, the path found being 
exactly the same as in figure B-middle where front propagation is done on all 
the image domain. We used a potential P(x) = | VG^ * /(x) | + ic, where / is the 
original image (512^ pixels, displayed in figure Bleft), Ga a Gaussian filter of 
variance cr = 2, and w = 1 the weight of the model. In figure Bright, the partial 
front propagation has visited less than 35% of the image. This ratio depends 
mainly on the length of the path tracked. 




Fig. 2. Comparing complete front propagation with partial front propagation method 
on a digital subtracted angiography (DSA) image 



Simultaneous partial front propagation The idea is to propagate simulta- 
neously a front from each end point po and p± . Lets consider the first grid point 
p where those fronts collide. Since during propagation the action can only grow, 
propagation can be stopped at this step. Adjoining the two paths, respectively 
between pq and p, and pi and p, gives an approximation of the exact mini- 
mal action path between pq and pi. Since p is a grid point, the exact minimal 
path might not go through it, but in its neighborhood. Basically, it exists a real 
point p*, whose nearest neighbor on the Cartesian grid belongs to the minimal 
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path. Therefore, the approximation done is sub-pixel and there is no need to 
propagates further. 

It has two interesting benefits for front propagation: 

— It allows a parallel implementation of the algorithm, dedicating a processor 
to each propagation; 

— It decreases the number of pixels examined during a partial propagation by 

- 53 ^ = 2 in 2D (figure El-right) ; 

— ^^^^3 = 4 in 3D (figure El left). 

because with the potential P = 1, the action map is the Euclidean distance. 

Note that it can also compute the Euclidean distance to a set of points by 
initializing P to be 0 at these points. 

In figure E)is displayed a test on a digital subtracted angiography (DSA) of brain 
vessels. The potential used is P(x) = |/(x) — Cj-l-w, where I is the original image 
(256^ pixels, displayed in figure 0-left), C a constant term (mean value of the 
start and end points gray levels), and ic = 10 the weight of the model. In figure0 
middle, the partial front propagation has visited up to 60% of the image. With 
a colliding fronts method, only 30% of the image is visited (see figure 0right), 
and the difference between both paths found is sub-pixel. 




Fig. 3. Propagation with potential P — 1 



One end point propagation We have shown the ability of the front propaga- 
tion techniques to compute the minimal path between two fixed points. In some 
cases, only one point should be necessary, or the needed user interaction for 
setting a second point is too tedious in a 3D image. We have derived a method 
that builds a path given only one end point and a maximum path length. The 
technique is similar to that of subsection 13.21 but the new condition will be to 
stop propagation when a path corresponding to a chosen Euclidean distance is 
extracted. A test of this path length condition is shown on figure 0 which is a 
DSA image of brain vessels. We have seen with figure 01eft that propagating a 
front with potential P = 1 computes the Euclidean distance to the start point. 
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Fig. 4. Comparing the partial front propagation with the colliding fronts method on a 
DSA image 




The original image The minimal action The path length map 

Fig. 5. Computing the Euclidean path length simultaneously 



Therefore, we use simultaneously an image-based potential P\, for building the 
minimal path and a potential P 2 = 1 for computing the path length. 

While we are propagating the front corresponding to P\ on the image domain, 
at each point p examined we compute both minimal actions for P\ (shown in 
figure 0-middle) and for P 2 (shown in figure 0right). In this case the action 
corresponding to P 2 is an approximate Euclidean length of the minimal path 
between p and po- 



4 The Path Centering Method 

In this section we derive a technique to track paths that are centered in a tubular 
shape, using the front propagation methods. To illustrate this problem, we use 
the example shown on figure 0- left, which is a binarized image of brain vessels. 
Using our classical front propagation, the minimal path extracted is tangential 
to the edges, as shown in figure 0middle, superimposed on the action map 
computed. This is due to the fact that length is minimized. This path is not 
tuned for problems which may require a centered path, and we will see in next 
section that it can be necessary for virtual endoscopy. In some cases it is possible 
to get the shape of the object in which we are looking for a path. One way of 
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The two paths The thresholded potential The centered minimal action 

Fig. 6. Comparing classic and centered paths 



making this shape available is to use the front propagation itself as shown in 
Figure El This is more detailed in Ej. If we have the shape of our object, we 
can use a front propagation method to compute the distance to its edges using 
a potential defined by 

P{i,j) = 1 V(i,j) G {object} . 

P{i,j)=oo y{i,j)£ {Background} . 

P{i,j) = 0 y{i,j) G {Interface} . 

When this distance map, noted £, is computed, it is used to create a potential P' 
which weights the points in order to propagate faster a new front in the centre of 
the desired regions. Choosing a value d to be the minimum acceptable distance 
to the walls, we propose the following potential: 

P'(x) = ||d — min(£’(x); d)!}'*' with 7 > 1 . (11) 

According to this new penalty, the final front propagates faster in the center of 
the vessel. This can be observed by looking at the shape of the iso-action lines 
of the centered minimal action shown in figure El-right. Finally, one can observe 
in figure 0 -left that the path avoids the edges and remains in the center of the 
vessel, while the former path tangential to edges. This method can be related 
to robotic problems like optimal path planning (see Pj for details), essentially 
because the potential shown in figure Elleft is binary. But there is no reason to 
limit the application of this algorithm to a binary domain. Thus, for continuously 
varying potential P, we use the same method. In section E| we present results on 
real 3D data applied to virtual endoscopy, where the problem is to find shortest 
paths on weighted domains. 

5 Application to Virtual Endoscopy 

In previous sections we have developed a series of issues in front propagation 
techniques. We study now the particular case of virtual endoscopy, where ex- 
traction of paths in 3D images is a very tedious task. 
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5.1 The Role of Virtual Endoscopy 

Visualization of volumetric medical image data plays a crucial part for diagnosis 
and therapy planning. The better the anatomy and the pathology are under- 
stood, the more efficiently one can operate with low risk. Different possibilities 
exist for visualizing 3D data: three 2D orthogonal views (see figure 0 , maxi- 
mum intensity projection (MIP, and its variants), surface and volume rendering. 
In particular, virtual endoscopy allows by means of surface/ volume rendering 




Fig. 7. Three orthogonal views of a volumetric CT data set of the colon 

techniques to visually inspect regions of the body that are dangerous and/or 
impossible to reach physically with a camera. A virtual endoscopic system is 
usually composed of two parts: 

1. A Path construction part, which provides the successive locations of the 
fly-through the tubular structure of interest (see figure EJleft); 

2. Three dimensional viewing along the endoscopic path (see figure |Hl-right). 




Original CT slice -I- path Endoscopic view 

Fig. 8. Interior view of a colon, reconstructed from a dehned path 




A major drawback in general remains when the path construction is left 
to the user who manually has to “guide” the virtual endoscope/camera. The 
required interactivity on a 3D image can be very tedious for complex structures 
such as the colon. Since the anatomical objects have often complex topologies, 
the path passes in and out of the three orthogonal planes. Consequently the right 
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location is accomplished by alternatively entering the projection of the wanted 
point in each of the three planes. Then, the path is approximated between the 
user defined points by lines or Bezier splines, and if the number of points is not 
enough, it can easily cross an anatomical wall. Path construction in 3D images 
is thus a very critical task and precise anatomical knowledge of the structure is 
needed to set a suitable trajectory, with the minimum required interactivity. 

Numerous techniques mM try to automate this path construction pro- 
cess by using a skeletonization technique as a pre-processing. It requires first to 
segment the object in order to binarize the image, then it extracts the skeleton 
of this volume. The skeleton often consists in lots of discontinuous trajectories, 
and post-processing is necessary to isolate and smooth the final path. But those 
methods can lead to critical cases: if there is a stenosis in the tubular structure, 
the binarization can produce two separate objects, where a skeletonization is 
inefficient. The front propagation techniques studied in this paper propose an 
alternative to the tedious manual path construction by building paths in 3D ima- 
ges with minimum interactivity. In contrast to other methods, it does not require 
any pre- or post-processing. We first apply this method to the case of virtual 
endoscopy in a colon CT dataset, then we extend it to a brain MR dataset. 

5.2 Application to Colonoscopy 

All tests are performed on a volumetric CT scan of size 512 x 512 x 140 voxels, 
shown in figure 0 We define a potential P from the 3D image I{x) that is 
minimal inside the anatomical shapes where end points are located. We chose the 
potential P{x) = |/(x) — /mean|“ + w, where an average grey level value /mean 
of the colon is obtained with an histogram. From this definition, P is lower inside 
the colon in order to propagate the front faster. Also, edges are enhanced with a 
non-linear function (a > 1) since the path to be extracted is in a large object that 
has complex shape and very thin edges. Then, using this potential, we propagate 
inside the colon creating a path between a couple of given points. In fact, the 
colon being a closed object with two extremities, one can use the Euclidean 
path length stopping criterion as explained in subsection El This allows to 
give only one end point. The figure 0 shows the result of the fast marching 
technique with a unique starting point belonging to the colon and an Euclidean 
path length criterion of 500 mm. This path has been computed in 10 seconds (in 
CPU time) on an UltraSparc 30 with a 300 MHz monoprocessor. However, this 
potential does not produce paths relevant for virtual endoscopy. Indeed, paths 
should remain not only in the anatomical object of interest but as far as possible 
from its edges. In order to achieve this target, we use the centering potential 
method as detailed in sectional This approach needs a shape information. This 
information is provided by the previous front propagation. From its definition, 
the front sticks to the anatomical shapes as shown in figure |3 This is related to 
the use of Fast Marching algorithm to extract a surface for segmentation P|. It 
gives a rough segmentation of the colon and provides a good information and a 
fast-reinitialization technique to compute the distance to the edges. Using this 
thresholded map as a potential that indicates the distance to the walls, we can 
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Fig. 9. Successive steps of front propagation inside the colon volume 




The two different paths Image potential Centering potential 



Fig. 10. Centering the path in the colon 

correct the initial path as shown in figure E3- left. Both 3D paths are projected 
on the 2D slice for visualization. As expected, the new path remains more in the 
middle of the colon. The two different cross-sections in figures Elmiddle andEl 
right display the view of the interior of the colon from both paths at the u-turn 
shown in figure E3-left. This effect of centering the path enhances dramatically 
the rendering of the video sequence of virtual endoscopy obtained. 0 With the 
initial potential, the path is near the wall, and we see the u-turn, whereas with 
the new path, the view is centered into the colon, giving a more correct view of 
the inside of the colon. 

Therefore, the two end points can be connected correctly, giving a path stay- 
ing inside the anatomical object. The results are displayed in two 3D views in 
figure El But for virtual colonoscopy, it is often not necessary to set the two 
end points within the anatomical object. 



^ This video will be shown at the presentation, and is available at 
http : //www. ceremade . dauphine . fr/~cohen/ECCV00. 
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Fig. 11. 3D Views of a path inside the colon 

5.3 Application to a Brain MRA Image 

Tests were performed on brain vessels in a magnetic resonance angiography 
(MRA) scan. The problem is different, because there is only signal from blood. 
All other structures have been removed. The main difficulty here lies in the 
variations of the dye intensity. The path shown from two viewpoints tracks (see 
figurtC2|) the superior sagittal venous canal, using a nonlinear function of the 
image dye intensity (P{x) = |/(x) — lOOp + 1). 




Fig. 12. Path tracking in brain vessels in a MR-Angiographic volume. 



6 Conclusion 

In this paper we presented a fast and efficient algorithm that computes a 3D path 
of minimal energy. This is particularly useful in medical image understanding 
for guiding endoscopic viewing. 

This work was the extension to 3D of a level-set technique developed in [Q 
for extracting paths in 2D images, given only the two extremities of the path 
and the image as inputs, with a front propagation equation. 
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We improved this front propagation equation by creating new algorithms 
which decrease the minimal path extraction computing cost, and reduce user 
interaction in the case of path tracking inside tubular structures. We showed that 
those techniques can be efficiently applied to the problem of finding a path in 
tubular anatomical structures for virtual endoscopy with minimum interactivity. 
In particular we extracted centered paths inside a CT dataset of the colon, and 
in a MR datasets of the brain vessels. We have proved the benefit of our method 
towards manual path construction, and skeletonization techniques, showing that 
only a few seconds are necessary to build a complete trajectory inside the body, 
giving only one or two end points and the image as inputs. 
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Abstract. In this paper, we study general questions about the solvabi- 
lity of the Kruppa equations and show that, in several special cases, the 
Kruppa equations can be renormalized and become linear. In particular, 
for cases when the camera motion is such that its rotation axis is par- 
allel or perpendicular to translation, we can obtain linear algorithms for 
self-calibration. A further study of these cases not only reveals generic 
difficulties with degeneracy in conventional self-calibration methods ba- 
sed on the nonlinear Kruppa equations, but also clarifies some incomplete 
discussion in the literature about the solutions of the Kruppa equations. 
We demonstrate that Kruppa equations do not provide sufficient con- 
straints on camera calibration and give a complete account of exactly 
what is missing in Kruppa equations. In particular, a clear relationship 
between the Kruppa equations and chirality is revealed. The results then 
resolve the discrepancy between the Kruppa equations and the necessary 
and sufficient condition for a unique calibration. Simulation results are 
presented for evaluation of the sensitivity and robustness of the proposed 
linear algorithms. 



Keywords: Camera self-calibration, Kruppa equations, renormalization, dege- 
neracy, chirality. 

1 Introduction 

The problem of camera self-calibration refers to the problem of obtaining intrin- 
sic parameters of a camera using only information from image measurements, 
without any a priori knowledge about the motion between frames and the struc- 
ture of the observed scene. The original question of determining whether the 
image measurements only are sufficient for obtaining intrinsic parameters of a 
camera was initially answered in nn- The proposed approach and solution uti- 
lize invariant properties of the image of the so called absolute conic. Since the 
absolute conic is invariant under Euclidean transformations {i.e., its represen- 
tation is independent of the position of the camera) and depends only on the 
camera intrinsic parameters, the recovery of the image of the absolute conic is 

* This work is supported by ARO under the MURI grant DAAH04-96- 1-0341. 
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then equivalent to the recovery of the camera intrinsic parameter matrix. The 
constraints on the absolute conic are captured by the so called Kruppa equati- 
ons initially discovered by Kruppa in 1913. In Section^ we will provide a much 
more concise derivation of the Kruppa equations. 

Certain algebraic and numerical approaches for solving the Kruppa equations 
were first discussed in m- Some alternative and additional schemes have been 
explored in 1 71171 . Nevertheless, it has been well-known that, in the presence of 
noise, these Kruppa equation based approaches are not guaranteed to provide a 
good estimate of the camera calibration and many erroneous solutions will occur 
IP. Because of this, we decide to revisit the Kruppa equation based approach in 
this paper. More specifically, we address the following two questions: 

1. Under what conditions do the Kruppa equations become degenerate or ill- 
conditioned? 

2. When conditions for degeneracy are satisfied, how do the self- calibration al- 
gorithms need to be modified? 

In this paper, we show that the answer to the former question is rather unfor- 
tunate: for camera motions such that the rotation axis is parallel or perpendicu- 
lar to the translation, the Kruppa equations become degenerate. This explains 
why conventional approaches to self-calibration based on the (nonlinear) Kruppa 
equations often fail. Most practical images are, in fact, taken through motions 
close to these two types. The parallel case shows up very frequently in motion 
of aerial mobile robots such as an helicopter. The perpendicular case is inte- 
resting in robot navigation, where the main rotation of the on-board camera is 
yaw and pitch, whose axes are perpendicular to the direction of robot heading. 
Nevertheless, in this paper, we take one step further to show that when such 
motions occur, the corresponding Kruppa equations can be renormalized and 
become lineari This fact allows us to correct (or salvage) classical Kruppa equa- 
tion based self-calibration algorithms so as to obtain much more stable linear 
self-calibration algorithms, other than the pure rotation case known to Hartley 
P). Our study also clarifies and completes previous analysis and results in the 
literature regarding the solutions of the Kruppa equations [E|. This is discussed 
in Section ISl 

Relations to Previous Works: Besides the Kruppa equation based self - 
calibration approach, alternative methods have also been studied extensively. 
For example some of them use the so called absolute quadric constraints 
PI, modulus constraints PI and chirality constraints p]. Some others 
restrict to special cases such as stationary camera ^ or to time-varying focal- 
length j6ll4) . We hope that, by a more detailed study of the Kruppa equations, 
we may gain a better understanding of the relationships among the various self- 
calibration methods. This is discussed in Section rO 

2 Epipolar Geometry Basics 

To introduce the notation, we first review in this section the well-known epipolar 
geometry and some properties of fundamental matrix to aid the derivation and 
study of Kruppa equations. 
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The camera motion is represented by {R,p) where i? is a rotation matrix as 
an element in the special orthogonal group SO{3) and p G is a three dimensio- 
nal vector representing the translation of the camera. That is, {R,p) represents 
a rigid body motion as an element in the special Euclidean group SE(S). The 
three dimensional coordinates (with respect to the camera frame) of a generic 
point q in the world are related by the following Euclidean transformation: 

q{t2) = R{t2,ti)q{ti) + p{t2,ti), Vti,t2SM. (1) 

We use the matrix A G to represent the intrinsic parameters of the 

camera, which we also refer to as the calibration matrix of the camera. In 
this paper, without loss of generality, we will assume det(yl) = 1, i.e., A is 
an element in the special linear group SL{3). SL(S) is the group consisting of 
3x3 real matrices with determinant equal to 1. This choice of A is slightly 
different from (and more general than) the traditional choice in the literature, 
but, mathematically, it is more natural to deal with. Then the (uncalibrated) 
image x (on the image plane in R^) of the point q at time t is given through 
the following equation: 



A(t)x(t) = Aq{t), Vt S R. (2) 

where X{t) G R is a scalar encoding the depth of the point q. Note that this 
model does not differentiate the spherical or perspective projection. 

Since we primarily consider the two-view case in this paper, to simplify the 
notation, we will drop the time dependency from the motion {R{t 2 , t\),p{t 2 , ti)) 
and simply denote it as (R,p), and also use xi,X2 as shorthand for x(ti),x(t2) 
respectively. Also, for a three dimensional vector p G R^, we can always associate 
to it a skew symmetric matrix p G R^^^ such that px q—pqior all g G R^0 
Then it is well known that the two image points Xi and X2 must satisfy the 
so called epipolar constraint: 

xf A“^ii^pA“^x2 = 0. (3) 

The matrix F = G R^^^ is the so called fundamental matrix 

in Computer Vision literature. When A = I, the fundamental matrix simply 
becomes i?^p which is called essential matrix and plays a very important role 
in motion recovery HD]. The following simple but extremely useful lemma will 
allow us to write the fundamental matrix in a more convenient form: 

Lemma 1 (Hat Operator). If p G R^ and A G SL{3), then A^pA = A~^p. 

Proof: Since both A'^(-)A and A~i(-) are linear maps from to R^^®, using the 

fact that det(A) = 1, one may directly verify that these two linear maps are equal on 
the bases: (1, 0, 0)^, (0, 1, 0)^ or (0, 0, 1)^. ■ 

^ In the computer vision literature, such a skew symmetric matrix is also often denoted 
as px- But we here use the notation consistent to robotics and matrix Lie group 
theory, where p is used to denote to elements in the Lie algebra so(3) of 5'0(3). 
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This simple lemma will be frequently used throughout the paper. By this lemma, 
we have: 

F = A-^R^pA-^ = A-'^ A^ A-'^pA-'^ = A-'^ R^ A^p' (4) 

where p' = Ap G is the so called epipole. This equation in fact has a 
more fundamental interpretation: an uncalibrated camera in a calibrated world 
is mathematically equivalent to a calibrated camera in an uncalibrated world 
(for more details see Pj). As we will soon see, the last form of the fundamental 
matrix in the above equation is the most useful one for deriving and solving the 
Kruppa equations. 

3 The Kruppa Equations 

Without loss of generality, we may assume that both the rotation R and trans- 
lation p are non-trivial, i.e., R ^ I and p ^ Q hence the epipolar constraint 
is not degenerate and the fundamental matrix can be estimated. The ca- 
mera self-calibration problem is then reduced to recovering the symmetric ma- 
trix io = A~"^ A~^ or = AA^ from fundamental matrices. It can be shown, 
even if we have chosen A to be an arbitrary element in SL{S), A can only be re- 
covered up to a rotation, i.e., as an element in the quotient space SL{3)/ SO{3), 
for more details see 0. Note that SL{3)/SO{3) is only a 5-dimensional space. 
From the fundamental matrix, the epipole vector p' can be directly computed 
(up to an arbitrary scale) as the null space of F. Given a fundamental matrix 
F = A~'^ R'^ A"'" p' , its scale, usually denoted as A, is defined as the norm of p'. 
If A = Up' II = 1, such a F is called a normalized fundamental matrix0For 
now, we assume that the fundamental matrix F happens to be normalized. 

Suppose the standard basis of is ei = (1,0,0)^, 62 = (0,1,0)^, 63 = 
(0,0, 1)^ S Now pick any rotation matrix Rq G S0{3) such that Rop' = 63 . 
Using Lemmam we have p' = RqFsRq. Define matrix D G to be: 

D = FR^ = A-^R^A^R^ea = A~^ R^ A'^ R^ {e2,-ei,0). (5) 

Then D has the form D = (■Ci, C 2 , 0) with ^ 1 , ^2 G being the first and second 
column vectors of D. We then have = A~'^ R'^ A^ RqC 2 , ^2 = — A~"‘' FA A^ R^ C\. 
Define vectors pi,p 2 G R^ as pi = —RQei,r ]2 = Roe 2 , then it is direct to 
check that satisfies: 

( 6 ) 

We thus obtain three homogeneous constraints on the matrix the inverse 
(dual) of the matrix (conic) u>. These constraints can be used to compute 
hence uj. 

The above derivation is based on the assumption that the fundamental matrix 
F is normalized, i.e., ||p'|| = 1. However, since the epipolar constraint is homoge- 
neous in the fundamental matrix F, it can only be determined up to an arbitrary 

Here || • || represents the standard 2-norm. 
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scale. Suppose A is the length of the vector p' G in F = A~"'" p' . Con- 
sequently, the vectors and ^2 are also scaled by the same A. Then the ratio 
between the left and right hand side quantities in each equation of 0 is equal 
to A^. This gives two equations on ui~^, the so called Kruppa equations (after 
its initial discovery by Kruppa in 1913): 



A^ = 



77JW-I772 









( 7 ) 



Alternative means of obtaining the Kruppa equations are by utilizing algebraic 
relationships between projective geometric quantities El or via SVD charac- 
terization of T’ 0. Here we obtain the same equations from a quite different 
approach. Equation 0 ) further reveals the geometric meaning of the Kruppa 
ratio A^: it is the square of the length of the vector p' in the fundamental matrix 
F. This discovery turns out to be quite useful when we later discuss the renor- 
malization of Kruppa equations. In general, each fundamental matrix provides 
at most two algebraic constraints on if the two equations in O happen to 
be independent. Since the symmetric matrix to has five degrees of freedom, in 
general at least three fundamental matrices are needed to uniquely determine tu. 
Nevertheless, as we will soon see, this is not the case for many special camera 
motions. 



Comment 1 One must be aware that solving Kruppa equations for camera calibration 
is not equivalent to the camera self-calibration problem in the sense that there may exist 
solutions of Kruppa equations which are not solutions of a “valid” self- calibration. 
Given a non-critical set of camera motions, the associated Kruppa equations do not 
necessarily give enough constraints to solve for the calibration matrix A. See Section 
rO for a complete account. 

The above derivation of Kruppa equations is straightforward, but the ex- 
pression o depends on a particular choice of the rotation matrix Rq - note 
that such a choice is not unique. However, there is an even simpler way to get 
an equivalent expression for the Kruppa equations in a matrix form. Given a 
normalized fundamental matrix F = A~"^ Rf^ p' , it is then straightforward to 
check that ui~^ = AA^ must satisfy the following equation: 

F^uj-^F = p'^oj-'^p'. ( 8 ) 

We call this equation the normalized matrix Kruppa equation. It is readily 
seen that this equation is equivalent to ®. If F is not normalized and is scaled by 
A G K., i.e., F = \A~"^ R"^ AAp' I we then have the matrix Kruppa equation: 




This equation is equivalent to the scalar version given by Q and is independent 
of the choice of the rotation matrix i?Q. In fact, the matrix form reveals that the 
nature of Kruppa equations is nothing but inner product invariants of the 
group ASO{i)A~^ (for more details see 0). 



® Without loss of generality, from now on, we always assume ||p'|l = 1. 
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3.1 Solving Kruppa Equations 

Algebraic properties of Kruppa equations have been extensively studied (see e.g. 
irnTTi i. However, conditions on dependency among Kruppa equations obtained 
from the fundamental matrix have not been fully discovered. Therefore it is hard 
to tell in practice whether a given set of Kruppa equations suffice to guarantee a 
unique solution for calibration. As we will soon see in this section, for very rich 
classes of camera motions which commonly occur in many practical applicati- 
ons, the Kruppa equations will become degenerate. Moreover, since the Kruppa 
equations (0 or (01 are highly nonlinear in w ^ , most self-calibration algorithms 
based on directly solving these equations suffer from being computationally ex- 
pensive or having multiple local minima m- These reasons have motivated us 
to study the geometric nature of Kruppa equations in order to gain a better un- 
derstanding of the difficulties commonly encountered in camera self-calibration. 
Our attempt to resolve these difficulties will lead to simplified algorithms for self- 
calibration. These algorithms are linear and better conditioned for these special 
classes of camera motions. 

Given a fundamental matrix F = A~'^R'^ A^p' with p' of unit length, the 
normalized matrix Kruppa equation (|H|) can be rewritten in the following way: 

- ARA-A-^A-^R^A^)P = 0. (10) 

According to this form, if we define C = A~~^ R~^ A"^ , a linear (Lyapunov) map 
a : — >• as a : X X — C^XC, and a linear map r : — >• R^^3 as 

T -.Y ^ pf Yp', then the solution of equation m is exactly the (symmetric 
real) kernel of the composition map: 

rocr: r3x 3 ^ j^3x3 ^ J^3x3_ 

This interpretation of Kruppa equations clearly decomposes effects of the rota- 
tional and translational parts of the motion: if there is no translation i.e., p = 0, 
then there is no map r; if the translation is non-zero, the kernel is enlarged due 
to the composition with map r. In general, the symmetric real kernel of the com- 
position map T o cr is 3 dimensional - while the kernel of a is only 2 dimensional 
(see 0). The solutions for the unnormalized Kruppa are much more complicated 
due to the unknown scale A. However, we have the following lemma to simplify 
things a little bit. 

Lemma 2. Given a fundamental matrix F = A~~^ R~^ A^p' with p' = Ap, a real 
symmetric matrix X G R^^3 solution of F^’^XF = Ap' Xp' if and only if 
Y = is a solution of E'^YE = ApFYp with E = RAp. 

Using Lemma0 the proof of this lemma is simply algebraic. This simple lemma, 
however, states a very important fact: given a set of fundamental matrices Ei = 
A~'^ RJ A^p'^ with p' = Api,i = 1, . . . ,n, there is a one-to-one correspondence 

between the set of solutions of the equations: FpXFi = Afp' Ap', i = 1, . . . ,n 
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and the set of solutions of the equations: EfYEi = Xfp[Ypi,i = 1, . . . , n where 
Ei = Rfpi are essential matrices associated to the given fundamental matrices. 
Note that these essential matrices are determined only by the camera motion. 
Therefore, the conditions of uniqueness of the solution of Kruppa equations only 
depend on the camera motion. Our next task is then to study how the solutions 
of Kruppa equations depend on the camera motion. 



3.2 Renormalization and Degeneracy of Krnppa Equations 

From the derivation of the Kruppa equations o or 0 , we observe that the 
reason why they are nonlinear is that we do not usually know the scale A. It is 
then helpful to know under what conditions the matrix Kruppa equation will 
have the same solutions as the normalized one, he., with A set to 1. Here we will 
study two special cases for which we are able to know directly what the missing A 
is. The fundamental matrix can then be renormalized and we can therefore solve 
the camera calibration from the normalized matrix Kruppa equations, which are 
linear! These two cases are when the rotation axis is parallel or perpendicular 
to the translation. That is, if the motion is represented by (R,p) S SE{3) and 
the unit vector m G is the axis of then the two cases are when u is 

parallel or perpendicular to p. As we will soon see, these two cases are of great 
theoretical importance: Not only does the calibration algorithm become linear, 
but it also reveals certain subtleties of the Kruppa equations and explains when 
the nonlinear Kruppa equations are most likely to become ill-conditioned. 

Lemma 3. Consider a camera motion (R,p) G SE{3) where R = e“®, 9 G (0, tt) 
and the axis u G is parallel or perpendicular top. Ifj G M and positive definite 
matrix Y are a solution to the matrix Kruppa equation: p^RYR^p = ^^jf'Yp 
associated to the essential matrix R"^p, then we must have 7 ^ = 1. Consequently, 
Y is a solution of the normalized matrix Kruppa equation: pr^RYR'^p = p^Yp. 

Proof: Without loss of generality we assume ||p|| = 1. For the parallel case, let a; G 
be a vector of unit length in the plane spanned by the column vectors of p. All such x 
lie on a unit circle. There exists xq G R® on the circle such that XqYxq is maximum. 
We then have XqRYR^xq = 'y^x'^Yxo, hence 7 ^ < 1. Similarly, if we pick xo such 
that XqYxo is minimum, we have 7 ^ > 1. Therefore, 7 ^ = 1. For the perpendicular 
case, since the columns of p span the subspace which is perpendicular to the vector 
p, the eigenvector rt of A is in this subspace. Thus we have: u^RYRi^u = 'y^u^Yu => 
u'^Yu = 'y^u'^Yu. Hence 7 ^ = 1 if T is positive definite. ■ 

Combining Lemma Eland Lemma 12 we immediately have: 

Theorem 1 (Renormalization of Kruppa Equations). Consider an un- 
normalized fundamental matrix F = A~^ R"^ p' where R = 6 G (0, tt) and 

the axis u, G is parallel or perpendicular to p = A~^p' . Let e = p'/||p'||. Then 
i/ A G K and a positive definite matrix io are a solution to the matrix Kruppa 
equation: F’^uj~^F = u:~A, we must have X^ = ll^lp. 



R can always be written of the form R = for some 6 G [0, tt] and u G S^. 
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This theorem claims that, for the two types of special motions considered here, 
there is no solution for A in the Kruppa equation @ besides the true scale of the 
fundamental matrix. Hence we can decompose the problem into finding A first 
and then solving for uj or uj~^. The following theorem allows to directly compute 
the scale A for a given fundamental matrix: 

Theorem 2 (Renormalization of Fundamental Matrix) . Given an unnor- 
malized fundamental matrix F = A^p' with ||p'|| =1, if p = A~^p' is 

parallel to the axis of R, then A^ is ||Fp'F^||, and if p is perpendicular to the 
axis of R, then A is one of the two non-zero eigenvalues of Fp' . 

Proof: Note that, since p'p' is a projection matrix to the plane spanned by the 

column vectors of p' , we have the identity p'p' p' = p' . First we prove the parallel 
case. It is straightforward to check that, in general, Fp'F"^ = }f ABFp. Since the axis 
of R is parallel to p, we have p = p so that Fp'F^ = X^p' . For the perpendicular 
case, let « G R® be the axis of R. By assumption p = A~^p' is perpendicular to u. 
Then there exists v G R® such that u = pA~^v. Then it is direct to check that p'v is 
the eigenvector of Fp' corresponding to the eigenvalue A. ■ 

Then for these two types of special motions, the associated fundamental matrix 
can be immediately normalized by being divided by the scale A. Once the fun- 
damental matrices are normalized, the problem of finding the calibration matrix 
from normalized matrix Kruppa equations 0 becomes a simple linear one! 
A normalized matrix Kruppa equation in general imposes three linearly indepen- 
dent constraints on the unknown calibration matrix given by (0 . However, this 
is no longer the case for the special motions that we are considering here. 

Theorem 3 (Degeneracy of Kruppa Equations). Consider a camera mo- 
tion (R,p) G SE{3) where R = has the angle 9 G (0,7t). If the axis it G 
is parallel or perpendicular to p, then the normalized matrix Kruppa equation: 
jFRYR^P = jAYp imposes only two linearly independent constraints on the 
symmetric matrix Y . 

Proof: For the parallel case, by restricting Y to the plane spanned by the column 

vectors of p, it is a symmetric matrix Y in R^^^. The rotation matrix R G 50(3) re- 
stricted to this plane is a rotation R G 50(2). The normalized matrix Kruppa equation 
is then equivalent to Y — RYR^ — 0. Since 0 < 9 < n, this equation imposes exac- 
tly two constraints on the three dimensional space of 2 x 2 real symmetric matrices. 
The identity 72x2 is the only solution. Hence the normalized Kruppa equation imposes 
exactly two linearly independent constraints on Y. 

For the perpendicular case, since u is in the plane spanned by the column vectors 
of p, there exist u G R^ such that (w, v) form an orthonormal basis of the plane. Then 
the normalized matrix Kruppa equation is equivalent to: 

RYR^p = pFYp<^{u, v)^ RY R^ {u, v) = {u, v)^ Y (u, v). (12) 

Since R^u = u, the above matrix equation is equivalent to two equations v^RYu = 
v"'^Yu,v'^ RY R^ V = v'^Yv. These are the only two constraints given by the normalized 
Kruppa equation. ■ 
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According to this theorem, although we can renormalize the fundamental 
matrix when rotation axis and translation are parallel or perpendicular, we only 
get two independent constraints from the resulting (normalized) Kruppa equa- 
tion corresponding to a single fundamental matrix. Hence for these motions, in 
general, we still need three such fundamental matrices to uniquely determine the 
unknown calibration. On the other hand, if we do not renormalize the fundamen- 
tal matrix in these cases and directly use the unnormalized Kruppa equations 
(0 to solve for calibration, the two nonlinear equations in 0 are in fact alge- 
braically dependent! Therefore, one can only get one constraint, as opposed to 
the expected two, on the unknown calibration This is summarized in Table 

ID 

Table 1. Dependency of Kruppa equation on angle 4> € [0, tt) between the rotation 
and translation. 



Cases 


Type of Constraints 


of Constraints on a; ^ 


{(f) 7^ 0) and {<f> 7^ f ) 


Unnormalized Kruppa Equation 


2 


Normalized Kruppa Equation 


3 


{(j) = 0 ) or {(j) = f) 


Unnormalized Kruppa Equation 


1 


Normalized Kruppa Equation 


2 



Although, mathematically, motion involving translation either parallel or per- 
pendicular to the rotation is only a zero-measure subset of SE{S), they are very 
commonly encountered in applications: Many images sequences are usually ta- 
ken by moving the camera around an object in trajectory composed of planar 
motion or orbital motion, in which case the rotation axis and translation 
direction are likely perpendicular to each other. Another example is a so called 
screw motion, whose rotation axis and translation are parallel. Such a mo- 
tion shows up frequently in aerial mobile motion. Our analysis shows that, for 
these types of motions, even if the sufficient conditions for a unique calibration 
are satisfied, a self-calibration algorithm based on directly solving the Kruppa 
equations o is likely to be ill-conditioned p. To intuitively demonstrate the 
practical significance of our results, we give an example in Figure^ Our analysis 
reveals that in these cases, it is crucial to renormalize the Kruppa equation using 
TheoremO once the fundamental matrix or Kruppa equations are renormalized, 
not only is one more constraint recovered, but we also obtain linear (normalized 
Kruppa) equations. 

Comment 2 (Solutions of the Normalized Kruppa Equations) Claims 
of Theorem run contrary to the claims of Propositions B.5 hence B.9 in \1 In 
Proposition B.5 of it is claimed that the solution space of the normalized Kruppa 
equations when the translation is parallel or perpendicular to the rotation axis is two 
or three dimensional. In Theorem\^ we claim that the solution space is always four 
dimensional. Theorem\^ does not cover the case when the rotation angle 9 is n. Ho- 
wever, if one allows the rotation to be tt, the solutions of normalized Kruppa equations 
are even more complicated. For example, we know = —p if u is of unit length 

and parallel to p (see Therefore, if R — the corresponding normalized Kruppa 
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Fig. 1. Two consecutive orbital motions with independent rotations: even if pairwise 
fundamental matrices among the three views are considered, one only gets at most 
1 + 1 + 2 = 4 effective constraints on the camera intrinsic matrix if the three matrix 
Kruppa equations are not renormalized. After renormalization, however, we may get 
back to2 + 2 + 2>5 constraints. 



equation is completely degenerate and imposes no constraints at all on the calibration 
matrix. 

Comment 3 (Number of Solutions) Although Theorem claims that for the 
perpendicular case A is one of the two non-zero eigenvalues of Fp' , unfortunately, 
there is no way to tell which one is the right one - simulations show that it could be 
either the larger or smaller one. Therefore, in a numerical algorithm, for given n > 3 
fundamental matrices, one needs to consider all possible 2" combinations. According 
to Theorem^ in the noise- free case, only one of the solutions can be positive definite, 
which corresponds to the the true calibration. 

3.3 Kruppa Equations and Chirality 

It can be shown that if the scene is rich enough (with to come) , then the necessary 
and sufficient condition for a unique camera calibration (see jS|) says that two 
general motions with rotation around different axes already determine a unique 
Euclidean solution for camera motion, calibration and scene structure. However, 
the two Kruppa equations obtained from these two motions will only give us at 
most four constraints on w, which is not enough to determine to which has five 
degrees of freedom. We hence need to know what information is missing from the 
Kruppa equation. State alternatively, can we get extra independent constraints 
on Lu from the fundamental matrix other than the Kruppa equation? 

The proof of Theorem |3 suggests another equation can be derived from the 
fundamental matrix F = A^p' with ||p'|| = 1. Since Fp'F^ = X^AR'^p, 

we can obtain the vector a = X^AR^^p = X^ AlA' A~^p' . Then it is obvious that 
the following equation for cu = A~'^ A~^ holds: 

T \4 /T , 

a Lva = X p ojp . 



( 13 ) 
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Notice that this is a constraint on w, not like the Kruppa equations which are 
constraints on Combining the Kruppa equations given in o with m we 
have: 

77^ w - i ?72 r]fuj-^r]i 

Is the last equation algebraically independent of the two Kruppa equations? 
Although it seems to be quite different from the Kruppa equations, it is in 
fact dependent on them. This can be shown either numerically or using simple 
algebraic tools such as Maple. Thus, it appears that our effort to look for extra 
independent, explicit constraints on A from the fundamental matrix has failedEI 
In the following, we will give an explanation to this by showing that not all u> 
which satisfy the Kruppa equations may give valid Euclidean reconstructions 
of both the camera motion and scene structure. The extra constraints which 
are missing in Kruppa equations are in fact captured by the so called chirality 
constraint, which was previously studied in [51 . We now give a clear and concise 
description between the relationship of the Kruppa equations and chirality. 

Theorem 4 (Kruppa Equations and Chirality). Consider a camera with 
calibration matrix I and motion (R,p). If p ^ Q, among all the solutions Y = 
of the Kruppa equation E^'^YE = X^pAYp associated to E = R'^p, 
only those which guarantee ARA~^ £ 5'0(3) may provide a valid Euclidean 
reconstruction of both camera motion and scene structure in the sense that any 
other solution pushes some plane N C to the plane at infinity, and feature 
points on different sides of the plane N have different signs of recovered depth. 

Proof: The images X 2 ,xi of any point g £ R® satisfy the coordinate transformation: 

A2X2 = AiRxi +p. 

If there exists Y = A~^A~'^ such that E^YE — X^pEVp for some A £ R, then the 
matrix E = A~^ EA~^ = A~"'^ EA A^p' is also an essential matrix with p' = Ap, that 
is, there exists R £ SO(3) such that F = lA p' (see P0| for an account of properties of 
essential matrices). Under the new calibration A, the coordinate transformation is in 
fact: 

A 2 AX 2 = AiARA“^(Axi) + p . 

Since F = p' = A~'^ RA A^p' , we have ARA~^ = R + plv^ for some v £ R®. 
Then the above equation becomes: A 2 AX 2 = AiR(Axi) + AipT^(Axi) Y p' . Let (3 = 
Aiu^(Axi) £ R, we can further rewrite the equation as: 

A2AX2 = AiRAxi + (d + 3 )p'- (15) 

Nevertheless, with respect to the solution A, the reconstructed images Axi , Ax 2 and 
{R,p') must also satisfy: 

72AX2 = 7iRAxi +p (16) 

® Nevertheless, extra implicit constraints on A may still be obtained from other al- 
gebraic facts. For example, the so called modulus constraints give three implicit 
constraints on A by introducing three extra unknowns, for more details see uni. 



T 

T]{UJ ^T]2 



p'^ujp' 



(14) 
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for some scale factors 71, 72 G R- Now we prove by contradiction that n 7^ 0 is impossible 
for a valid Euclidean reconstruction. Suppose that v ^ 0 and we dehne the plane 
N = {q G = —1}. Then for any q = Aij 4 xi G N, we have /3 = —1. Hence, 

from (O, j4xi,7lx2 satisfy A27IX2 = AiRTxi. Since Hxi,Hx2 also satisfy ca) and 
p' 7^ 0, both 71 and 72 in (|J^ must be 00. That is, the plane N is “pushed” to the 
plane at infinity by the solution A. For points not on the plane N, we have /3 + 1 7^ 0. 
Comparing the two equations 1151 and (Util) , we get 7^ = \i/{j3 + l),i = 1, 2. Then for 
a point in the far side of the plane N , i.e., /3 + 1 < 0, the recovered depth scale 7 is 
negative; for a point in the near side of N, i.e., + 1 > 0, the recovered depth scale 7 

is positive. Thus, we must have that w = 0. ■ 

Comment 4 (Quasi-afRne Reconstruction) Theorem^essentially implies the 
chirality constraints studied in According to the above theorem, if only finitely many 
feature points are measured, a solution of the calibration matrix A which may allow a 
valid Euclidean reconstruction should induce a plane N not cutting through the convex 
hull spanned by all the feature points and camera centers. Such a reconstruction is 
referred as quasi- affine in 0/. 

It is known that, in general, all A’s which make ARA~^ a rotation matrix 
form a one parameter family j0| . Thus, following Theorem^ a camera calibration 
can be uniquely determined by two independent rotations regardless of transla- 
tion if enough feature points are available. An intuitive example is provided in 
Figure 0 




Fig. 2. A camera undergoes two motions (J?i,pi) and (i?2,P2) observing a rig consisting 
of three straight lines Li, L2, L3. Then the camera calibration is uniquely determined 
as long as Ri and R2 have independent rotation axes and rotation angles in (0,7r), 
regardless of pi,P2. This is because, for any invalid solution A, the associated plane N 
(see the proof of Theorem EJ must intersect the three lines at some point, say q. Then 
the reconstructed depth of point q with respect to the solution A would be infinite 
(points beyond the plane N would have negative recovered depth). This gives us a 
criteria to exclude all such invalid solutions. 



The significance of Theorem 0 is that it explains why we get only two con- 
straints from one fundamental matrix even in the two special cases when the 
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Kruppa equations can be renormalized - extra ones are imposed by the struc- 
ture, not the motion. The theorem also resolves the discrepancy between the 
Kruppa equations and the necessary and sufficient condition for a unique ca- 
libration: the Kruppa equations, although convenient to use, do not provide 
sufficient conditions for a valid calibration which allows a valid Euclidean re- 
construction of both the camera motion and scene structure. However, the fact 
given in Theorem 0] is somewhat difficult to harness in algorithms. For exam- 
ple, in order to exclude invalid solutions, one needs feature points on or beyond 
the plane fvH Alternatively, if such feature points are not available, one may 
first obtain a projective reconstruction and then use the so called absolute 
quadric constraints to calibrate the camera m- However, in such a method, 
the camera motion needs to satisfy a stronger condition than requiring only two 
independent rotations, he., it cannot be critical in the sense specified in US]. 



4 Simulation Results 



In this section, we test the performance of the proposed algorithms through 
different experiments. The error measure between the actual calibration matrix 
A and the estimated calibration matrix A was chosen to be error = x 100. 

For all the simulations, field of view is chosen to be 90 degrees for a 500 x 500 pixel 
image size; a cloud of 20 points are randomly chosen with depths vary from 100 
to 400 units of focal length; the number of trials is always 100 and the number 
of image frames is 3 to 4 (depending on the minimum number of frames needed 
by each algorithm) . The calibration matrix A is simply the transformation from 
the original 2 x 2 (in unit of focal length) image to the 500 x 500 pixel image. 

/250 0 250 \ 

For these parameters, the true A should be A = 0 250 250 1 . The ratio of 

V 0 0 1 / 



the magnitude of translation and rotation, or simply the T / R ratio, is compared 
at the center of the random cloud (scattered in the truncated pyramid specified 
by the given field of view and depth variation). For all simulations, the number 
of trials is 100. 

Pure rotation case: For comparison, we here also implement the linear algo- 
rithm proposed by Hartley ^ for calibrating a pure rotating camera. Figures 
0 and 0 show the experiments performed in the pure rotation case. The axes 
of rotation are X and Y for Figures 0 and 0 The amount of rotation is 20°. 
The perfect data was corrupted with zero-mean Gaussian noise with standard 
deviation a varying from 0 to 5 pixels. In Figures 0 it can be observed that 
the algorithm performs very well in the presence of noise, reaching errors of less 
than 6% for a noise level of 5 pixels. Figure Elshows the effect of the amount of 
translation. This experiment is aimed to test the robustness of the pure rotation 
algorithm with respect to translation. The T/R ratio was varied from 0 to 0.5 



Some possible ways of harnessing the constraints provided by chirality have been 
discussed in 0. Basically they give inequality constraints on the possible solutions 
of the calibration. 
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and the noise level was set to 2 pixels. It can be observed that the algorithm is 
not robust with respect to the amount of translation. 




Translation parallel to rotation axis: Figures 0 and El show the experi- 
ments performed for our algorithniQ when translation is parallel to the axis of 
rotationl^ The non-isotropic normalization procedure proposed by Hartley | 2 ] 
and statistically justified by Miihlich and Mester was used to estimate the 
fundamental matrix. Figure 0 shows the effect of noise in the estimation of the 
calibration matrix foY T/R — 1 and a rotation of 6* = 20° between consecutive 
frames. It can be seen that the normalization procedure improves the estimation 
of the calibration matrix, but the improvement is not significant. This result is 
consistent with that of since the effect of normalization is more important 
for large noise levels. On the other hand, the performance of the algorithm is 
not as good as that of the pure rotation case, but still an error of 5% is reached 
for a noise level of 2 pixels. Figure 0 shows the effect of the angle of rotation 
in the estimation of the calibration matrix for a noise level of 2 pixels. It can 
be concluded that a minimum angle of rotation between consecutive frames is 
required for the algorithm to succeed. 

Translation perpendicular to rotation axis: Figures 0 and 0 show the ex- 
periments performed for our algorithm when translation is perpendicular to the 
axis of rotation. It can be observed that this algorithm is much more sensitive 
to noise. The noise has to be less than 0.5 pixels in order to get an error of 5%. 
Experimentally it was found that Kruppa equations are very sensitive to the 

^ Although in this paper we do not outline the algorithm, it should be clear from 
Section O 

® For specifying the Rotation/Translation axes, we simply use symbols such as “XY- 
YY-ZZ" which means: for the first pair of images the relative motion is rotation 
along X and translation along Y ; for the second pair both rotation and translation 
are along Y ; and for the third pair both rotation and translation are along Z. 
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Fig. 5. Rotation parallel to translation 
case. 9 = 20°. Rotation/Translation 
axes: XX-YY-ZZ , T/R ratio = 1. 




Fig. 6. Rotation parallel to transla- 
tion case. (7 = 2. Rotation/Translation 
axes: XX-YY-Z Z , T/R ratio = 1. 



normalization of the fundamental matrix F and that the eigenvalues Ai and A 2 
of Fp' are close to each other. Therefore in the presence of noise, the estima- 
tion of those eigenvalues might be ill conditioned (even complex eigenvalues are 
obtained) and so might the solution of Kruppa equations. Another experimental 
problem is that more than one non-degenerate solution to Kruppa equations can 
be found. This is because, when taking all possible combinations of eigenvalues of 

Fp' in order to normalize F, the smallest eigenvalue of the linear map associa- 
ted to “incorrect” Kruppa equations can be very small. Besides, the eigenvector 
associated to this eigenvalue can eventually give a non-degenerate matrix. Thus 
in the presence of noise, you can not distinguish between the correct and one 
of these incorrect solutions. The results presented here correspond to the best 
match to the ground truth when more than one solution is found. Finally it is 
important to note that large motions can significantly improve the performance 
of the algorithm. Figure shows the error in the estimation of the calibration 
matrix for a rotation of 30°. It can be observed that the results are comparable 
to that of the parallel case with a rotation of 20°. 

Robustness: We denote the angle between the rotation axis and translation by 
(f). The two linear algorithms we have studied in the above are only supposed 
to work for the cases (^ = 0° and (p = 90°. In order to check how robust these 
algorithms are, we run them anyway for cases when (p varies from 0° to 90°. 
The noise level is 2 pixels, amount of rotation is always 20° and the T/R ratio 
is 1. Translation and rotation axes are given by Figured Surprisingly, as we 
can see from the results given in Figure E3, for the range 0° < ^ < 50°, both 
algorithms give pretty close estimates. Heuristically, this is because, for this range 
of angle, the eigenvalues of the matrix Fp' are complex and numerically their 
norm is very close to the norm of the matrix Fp'F"^ . Therefore, the computed 
renormalization scale A from both algorithms is very close, as is the calibration 

estimate. For (p > 50°, the eigenvalues of Fp' become real and the performance 
of the two algorithms is no longer the same. Near the conditions under which 
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Fig. 7. Rotation orthogonal to 
translation case. 9 = 20°. Rota- 

tion/Translation axes: XY-Y Z-ZX , 
T/R ratio = 1. 




Fig. 8. Rotation orthogonal to 
translation case. 9 — 30°. Rota- 

tion/Translation axes: XY-Y Z-ZX , 
T/R ratio = 1. 



these algorithms are designed to work, the algorithm for the perpendicular case 
is apparently more sensitive to the perturbation in the angle 4 > than the one for 
the parallel case: As clear from the figure, a variation of 10° degree of 4 > results 
an increase of error almost 50%. We are currently conducting experiments on 
real images and trying to find ways to overcome this difficulty. 




Fig. 9. The relation of the three rota- 
tion axes uji,uj 2 , and three transla- 
tions P1,P2,P3- 




Fig. 10. Estimation error in calibra- 
tion w.r.t. different angle 4>- 



5 Conclusions 

In this paper, we have revisited the Kruppa equations based approach for camera 
self-calibration. Through a detailed study of the cases when the camera rotation 
axis is parallel or perpendicular to the translation, we have discovered generic 
difficulties in the conventional self-calibration schemes based on directly solving 
the nonlinear Kruppa equations. Our results not only complete existing results in 
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the literature regarding the solutions of Kruppa equations but also provide brand 
new linear algorithms for self-calibration other than the well-known one for a 
pure rotating camera. Simulation results show that, under the given conditions, 
these linear algorithms provide good estimates of the camera calibration despite 
the degeneracy of the Kruppa equations. The performance is close to that of the 
pure rotation case. 
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Abstract. We focus in this paper on the problem of adding computer- 
generated objects in video sequences that have been shot with a zoom 
lens camera. While numerous papers have been devoted to registration 
with fixed focal length, little attention has been brought to zoom lens 
cameras. In this paper, we propose an efficient two-stage algorithm for 
handling zoom changing which are are likely to happen in a video se- 
quence. We hrst attempt to partition the video into camera motions 
and zoom variations. Then, classical registration methods are used on 
the image frames labeled camera motion while keeping the internal pa- 
rameters constant, whereas the zoom parameters are only updated for 
the frames labeled zoom variations. Results are presented demonstrating 
registration on various sequences. Augmented video sequences are also 
shown. 



1 Introduction 

Augmented Reality (AR) is a technique in which the user’s view is enhanced 
or augmented with additional information generated from a computer model. In 
contrast to virtual reality, where the user is immersed in a completely computer- 
generated world, AR allows the user to interact with the real world in a natural 
way. This explains why interest in AR has substantially increased in the past 
few years and medical, manufacturing or urban planning applications have been 
developed 

In order to make AR systems effective, the computer generated objects and 
the real scene must be combined seamlessly so that the virtual objects align well 
with the real ones. It is therefore essential to determine accurately the location 
and the optical properties of the cameras. The registration task must be achieved 
with special care because the human visual system is very good at detecting even 
small mis-registrations. 

There has been much research in the field of vision-based registration for 
augmented reality 111121141.1^. However these works assume that the internal 
parameters of the camera are known (focal length, aspect ratio, principal point) 
and they only address the problem of computing the pose of the camera. This is 
a strong limitation of these methods because zoom changing is likely to happen 
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in a video sequence. A method is proposed in P. which can retrieve metric 
reconstruction from image sequences obtained with uncalibrated zooming cam- 
eras. However, considering unknown principal point leads to unstable results if 
the projective calibration is not accurate enough, the sequence not long enough, 
or the motion sequence critical towards the set of constraints. More stable results 
are obtained when the principal point is considered as fixed in the centre of the 
image, but this assumption is not always fulfilled (see m) and is not accurate 
enough for image composition. Other attempts have been made to cope with 
varying internal parameters for AR applications However this approach 
uses targets arbitrarily positioned in the environment. It is therefore of limited 
use if outdoor scenes are considered. 

In this paper we extend our previous works on vision based registration 
methods to the case of zoom-lens cameras. Zoom- lens camera calibration 

is still found to be very difficult for several reasons 1 1 tif;-ij : modeling a zoom- 
lens camera is difficult due to optical and mechanical misalignments in the lens 
system of a camera. Moreover, zoom-lens variations can be confused with camera 
motions: for instance, it is difficult to discriminate a translation along the optical 
axis from a zoom. 

In this paper, we take advantage of our application field to reduce the problem 
complexity. Indeed, we assume that the viewpoint and the focal length do not 
change at the same time. This assumption is compatible with the techniques used 
by professional movie-makers. We develop in this paper an original statistical 
approach: for each frame of the sequence, we test the hypothesis of a zoom against 
the hypothesis of a camera motion. If the motion hypothesis is retained, we still 
have to compute the camera pose with the old internal parameters. Otherwise, 
the internal parameters are computed assuming that the camera pose does not 
change. Camera parameters are supposed to be known in the first image of the 
sequence (they can be obtained easily from a set of at least 6 2D/3D point 
correspondences pointed out by the user). 

This paper is organized as follows: first, we discuss in section 0 the pinhole 
camera model and we show the difficulties to recover both the camera pose 
and the internal parameters with varying focal lengths. Section 0 then describes 
our original method for zoom/motion partitioning of the sequence. Section 0 
describes how registration is performed from this segmentation. Examples which 
demonstrate the effectiveness of our method are shown in sectional 



2 Registration Difficulties with a Zoom-Lens Camera 

In this section, we first describe the pinhole model which is widely used for 
camera modeling. Then we describe our attempts to compute both the zoom 
and the motion parameters in a single stage. This task is called full calibration 
in the following. We show that classical registration methods fail to recover 
both the internal and the external parameters, even though some of the intrinsic 
parameters are fixed. 
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2.1 The Pinhole Camera Model 



Let {X, y, Z) represent the coordinates of any visible point M in a fixed reference 
system (world coordinate system) and let {Xc,Yc, Z^) represent the coordinates 
of the same point in the camera centered coordinate system. The relationship 
between the two coordinate systems is given by 
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where [R, T] is the 3D displacement (rotation and translation) from the world 
coordinate system to the camera coordinate system. 

We assume that the camera performs a perfect perspective transform with 
center O at a distance / of the image plane. The projection of M on the image 
plane is {x = f^,y = /^)- If 1/^u (resp l/ky) is the size of the pixel along 
the X axes (resp. y axes), its pixel coordinates are: 

m = {kuf^ + uo,kyf^ + vo) ( 1 ) 

where uq,vo are the coordinates of the principal point of the camera (i.e. the 
intersection of the optical axis and the image plane) . 



The coordinates of a 3D point M in a world coordinate system and its pixel 
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Full camera calibration amounts to compute 10 parameters: 6 external pa- 
rameters (3 for the rotation and 3 for the translation) and 4 internal parameters 
(a„ = kyf, ay = kyf, uq and vq). Internal and external parameters are collec- 
tively referred to as camera parameters in the following. 



2.2 Direct Full Calibration 

When the internal parameters are computed off-line, the registration process 
amounts to compute the displacement [R^T] which minimizes the re-projection 
error, that is the error between the projection of known 3D features in the scene 
and their corresponding 2D features detected in the image. For sake of clarity, 
we only suppose that the 3D features are points but we can also consider free 
form curves m- Moreover, we show in section 0 that 2D/2D correspondences 
can be added to improve the viewpoint computation. 

The camera pose is therefore the displacement [R,T] which minimizes the 
reprojection error 

min dist{proj (Mi) , rrii)^ 
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where minimization is performed only on the 6 external parameters (Euler angles 
and translation). 

Theoretically, zoom-lens variations during shooting can be recovered in the 
same way. We have therefore to compute not only the camera viewpoint but also 
the internal camera parameters (focal length, pixel size, optical center) which 
minimize the reprojection error. 

min y dist(proj(Mi),miY 

R.T,OLu-,Otvi'^Qi'^Q 

As mentioned by several authors |3j, this approach is unable to recover both 
the internal and external parameters. To overcome this problem, some authors 
have proposed to reduce the number of unknowns by fixing some of the internal 
parameters to predefined values. As several experimental studies proved that 
the ratio ^ remains almost constant during zoom variations |^, the set of the 
internal parameters to be estimated is then reduced to uq, fo- Unfortunately 
this approach fails to recover the right camera parameters. Consider for instance 
Fig.lU which exhibits the results when registration is achieved on the 6 external 
parameters and the 3 internal parameters. As the house stands on a calibration 
target, the internal and external parameters can be computed for each frame us- 
ing classical calibration techniques pj. They can therefore be compared to those 
computed with the registration method. The camera motions with respect to the 
turntable and zoom variations during the cottage sequence are shown in Table 
Ola. The camera trajectory along with the focal length computed for each frame 
of the sequence are shown in Fig. Q in dashed lines. They have to be compared 
to the actual parameters which are shown in solid lines on the same figure. Note 
that the trajectory is the position of the camera in the horizontal plane and the 
arrows indicates the optical axis. These results prove that some camera motions 
are confused with zoom variations: besides the common confusion between zoom 
and translation along the optical axis, other motions do not correspond to the 
actual one: between the frames 13 and 14, an unexpected translation is detected 
and is compensated by a camera zoom out. 

Such confusions are also observed in P|, but Bougnoux considers that they 
do not really affect the quality of the reconstruction of the scene. Unfortunately, 
the conclusion is not the same for the quality of a composition: an augmented 
sequence of the cottage using the computed viewpoints and focal length is shown 
on our web site. Small errors on the camera parameters do not really affect the 
reprojection of the scene but they induce jittering effects which affect the realism 
of the composition. 

To take into account the interdependance of the internal parameters, Sturm 
expresses uq and vq as polynomial functions of o:„ PI- As the aspect ratio 
aujoiy remains constant over the sequence, only one internal parameter has 
to be determined. However, to determine the degrees and the coefficients of 
the polynomial models, the camera has to be pre-calibrated for several zoom 
positions. 

Hence, resolving the general full calibration problem is difficult. In this paper, 
we propose a robust solution to the particular case of sequences where camera 



582 G. Simon and M.-O. Berger 




Fig. 1. (a) A snapshot of the cottage sequence and the reprojection of the 3D features, 
(b) The actual camera trajectory (solid line) and the computed one (dashed line), (c) 
The actual (solid line) and the estimated (dashed line) focal length during the sequence. 



pose and zoom do not change at the same time. This particular case is very inter- 
esting for practical applications: indeed, when professional movie-makers make 
shootings, they generally avoid to mix camera motions and zoom variations. To 
take advantage of the structure of these sequences, we compute the reprojection 
error for each frame of the sequence in the two possible cases zoom alone and 
camera motion alone: (i) we consider that the internal parameters do not change 
and we search for the camera pose [R,T] that minimizes the reprojection error 
(ii) we consider that the camera is fixed and we search for the internal param- 
eters. Surprisingly, experiments we conducted show that the smallest of these 
two residuals does not always match the right camera parameters: Fig. 0 plots 
the reprojection error between frames 22 to 35 on a camera zoom sequence. For 
each frame i, the reprojection error between frame 20 and frame i is computed 
for the zoom and the motion hypothesis. This allows us to see the influence of 
the zoom magnitude on the criterion. The results prove that this method fails to 
recover the right camera parameters unless the magnitude of the zoom variation 
is high. 




Fig. 2. Reprojection error with the zoom and the motion assumption for a camera 
zoom motion. 




Registration with a Moving Zoom Lens Camera 583 



3 Discriminating between Zoom Variation and Camera 
Motion 

The above results show that the classical registration methods cannot be used to 
cope with zoom-lens cameras. We therefore resort to a two-stage method: we first 
attempt to partition the video into camera motions and zoom variations. Then, 
our registration method is used on the image frames labeled camera motion while 
keeping the internal parameters constant, whereas the internal parameters are 
only computed for the frames labeled zoom variations. Unlike other methods for 
video partitioning which are based on the analysis of the optic flow m , our 
method is only based on the analysis of a set of 2D corresponding points which 
are automatically extracted and matched between two consecutive images. The 
motion information brought by the key-point is very reliable and allows us to dis- 
criminate easily between zoom variation and translation along the optical axis. 
Our approach stands out from m in several points : in m, the mean and the 
standard deviation of the optical flow are computed in seven non-overlapping 
sub-regions of the image. These values are compared with thresholds to dis- 
criminate between zoom, tilt, pan, Z-rotation, horizontal translation, vertical 
translation and Z-translation. However, it is not explained how the thresholds 
are computed, whereas it is the main point of the algorithm (furthermore, many 
confusions are observed in the final results). Moreover, to discriminate between 
a zoom and a Z-translation, the authors suppose that the center of the zoom is 
the center of the image, which is not true in practical situations PI- 

Section rm describes the way to extract key-points. Then we present the 
affine model of a zoom introduced in 0. Finally we give our algorithm for 
zoom/motion automatic segmentation of the sequence (13.311 . 



3.1 Extracting and Matching Key- Points 



Key-points (or interest points) are locations in the image where the signal 
changes two dimensionally: corners, T-junctions or locations where the texture 
varies significantly. We use the approach developed by Harris and Stephens [Zj: 
they exploit the autocorrelation function of the image to compute a measure 
which indicates the presence of an interest point. More precisely, the eigenvalues 
of the matrix 

r/2 T I 

-'x ^xJ.y 

IIP 






are the principal curvatures of the auto-correlation function. If these values are 
high, a key-point is declared. 

We still have to match these key-points between two consecutive images. To 
do this, we use correlation techniques as described in IP- 

Fig 0a and 01b exhibit the key-points which have been automatically ex- 
tracted in two successive images in the loria scene and Fig. 01c shows the matched 
key-points. 
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Fig. 3. (a,b) Key-points extracted in two consecutive frames, (c) The matched key- 
points. 



3.2 Modeling Zoom-Lens Cameras 

Previous studies on zoom-lens modeling proved that the ratio ^ is very stable 
over long time periods. On the contrary, the position of the principal point 
(ug, vq) depends on the zooming position of the camera. This point can vary up 
to 100 pixels while zooming! However, for most camera lens, it can be shown that 
the principal point varies on a line while zooming P] . That is the reason why an 
affine model with 3 parameters Co,ao, bg can be used to describe zoom variations. 
Enciso and Vieville 0 show that if {u' ,v') and {u,v) are corresponding points 
after zooming, we have 

ju' = Cgu + ag, 

\v' = CgV + bg. 

The current matrix of the internal parameters is therefore deduced from 
the previous one A by: 



/ Cg 0 ag\ 

= 0 Cgbg]A. ( 3 ) 

\o 0 1 ; 

If we want to use this property to discriminate between a zoom and a camera 
motion, we must prove that a camera motion can not be approximated by the 
same model. This can be shown from the equations of the optical flow : the 
optical flow (or instantaneous velocity) of an image point (a; = /^,y = /^), 
is 

( X = + Axy - B{x^ + 1) + Cy, 

1 y = + 1) - Bxy - Cx, 

where (C/, V, is the translational component of the motion of the camera, 
{A, B, Cy is its angular velocity and / is set to 1 |E]. The optical flow obtained 
for the basic motions (horizontal translation), Ty (vertical translation), 
(Z-translation), (tilt), Ry (pan) and Rz (Z-rotation) are given in table [Ua. 
Theoretically, none of these motions can be described by an affine transformation 
with three parameters. However, if Zc = Zg+AZ where AZ ^ Zg for each model 
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point, that is the depth of the object is small with regard to the distance from 
the object to the camera (case 1), then T^, Ty and can be approximated by 
a zoom model whose parameters Cq, oq and are given in table (we use 
the approximation x = -^ = and y = f-^)- Moreover, if a; ^ 1 and 

y ^ 1, that is the focal length is large (case 2), then and Ry can also be 
approximated by a zoom model (see table ^b). 

Hence, some camera motions can induce an image motion close to the model 
of the zoom. Fortunately, most of them can easily be identified as camera 
motions. Indeed, for a zoom motion, the invariant point of the affine model 
( , i-Co ^ principal point of the camera and lies approximately in the 

middle of the image. On the contrary, for Ty, R^ and Ry, this point is out- 
side the image and goes to infinity because Co is close to 1. Finally, only the 
translation along the optical axis is really difficult to discriminate from a 
zoom. 
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Table 1. (a) Optical flow obtained for the basic motions, (b) Parameters of the 

approximating affine model for ambiguous cases. 



3.3 Zoom/Motion Partioning 

In this section, we present our approach for zoom/motion partioning. For each 
frame of the sequence, we test the hypothesis of a zoom against the hypothe- 
sis of a camera motion. We proceed as follows: key-points {ui,Vi){i<i<M} and 
(it' , u'){i<i<AT} are extracted and matched in two consecutive frames R and R+i ■ 
If we suppose that a zoom occurs, the model parameters Co, Qq, bo which best fit 
the set of corresponding key-points are computed by minimizing the residual 

1 ^ 

r = — - CoUi - aof + (v- - Cqu* - bof. (4) 

' i=l 

We must now estimate the goodness of fit of the data to the affine model of 
the zoom. We have to test if the discrepancy r is compatible with the noise 
magnitude on the extracted key-points. Otherwise the zoom hypothesis should 
be questioned. 

Statistical tests, such as tests, are often used to estimate the compatibility 
of the data with the model with a given significance level a (90% for instance) . 
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However, the standard deviation is needed for each datum. In our case, it is very 
difficult to calculate an error on the location of the key points. The test has 
also a serious drawback: how can we set the significance level a? For a very large 
value of a, the hypothesis is always admitted, while for a very small value of a 
the hypothesis is always rejected. 

That is the reason why we resort to another criterion to assess the zoom hy- 
pothesis. An important thing to note is that a zoom variation does not introduce 
new features in the images whereas translation motion does: some features which 
are visible for a camera viewpoint are no longer visible for a neighboring camera 
position. In Fig. 0a, point A is not visible from Ck because it is occluded by 
the object Oi. But point A becomes visible when the camera moves from Ck to 
Cfc+i. Note that such a phenomenon also arises for translation along the optical 
axis (Fig. 0b). These features which become visible due to the camera motion 
are very important for assessing the zoom hypothesis. As key-points are not 
necessarily detected in the areas which become visible or which disappear, the 
key-points are not well suited for zoom assessment. We therefore use the set of 
all the contours detected in image Ik to assess the parameters (if Co < I we use 
image Ik+i)- We first compute a correlation score for each contour. This score 
belongs to [—1, 1] and is all the better that the zoom hypothesis is fulfilled. If the 
zoom hypothesis is satisfied, the gray levels Ik{u, v) and Ik+i{Cou + ao, Cov + bo) 
must be nearly the same. Moreover the neighborhood of these two corresponding 
points must be similar. We therefore use the correlation score to evaluate the 
zoom hypothesis. First, we define the correlation for a given point m = (u, v) in 

Ik- 



score{m) 



-^i,v+j) X Ik+i{Co{u -I- i) -I- ao,Co{v + j) -I- bo) 
(2n -I- l)2cr(7fc)cr(7fc+i) 



where a{Ik) (resp. a{Ik+i)) is the standard deviation of Ik (resp. Ik+i) at point 
{u, v) in the neighborhood (2n-|-l) x {2n+l) of (m, v) (resp. {Cou + ao, Cou-|-6o)). 
The score ranges from —I for two correlation windows which are not similar at 
all, to I for two correlation windows which are identical. 

If a contour is given by the points mi, the score of a contour C is 

defined as the average of the scores of all points: 



i—p 

score{C) = I/p^^ score{mj). 

i=l 

Finally the score of the zoom hypothesis is computed as the minimum of the 
score of each contour (note that only the strong contours are kept). This is a 
robust way to assess the zoom hypothesis. Indeed, if a zoom variation really 
happens, the score is high for each contour, and the global score is high too. On 
the contrary, if a camera motion happens, the score is generally low for nearly 
all the contours when the camera moves because the affine zoom model does not 
match the image transformation. Moreover, in case of a translating motion, the 
score is low for the contours of Ik which are occluded in Ik+i- Hence the global 
score is low too. 

We still have to choose a threshold Th score which allows us to distinguish 
between zoom variation and camera motion according to the global score. This 
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C, C , , 

k k+1 



Fig. 4. New features appear under translating motion: point A is not visible from Ck 
but becomes visible from Ck+i- 



value has been determined experimentally on various sequences. Experiments 
we have conducted (see section 15.2^ prove that the value Thgcore = -5 can be 
used for all the considered sequences to discriminate between zoom variation 
and camera motion even for the difficult case of a translation along the optical 
axis. Hence, if global^core > .5 and if the invariant point of the affine model lies 
inside the image, then the zoom hypothesis is accepted, otherwise the camera 
motion hypothesis is retained. 



4 Registration with a Zoom Lens Camera 

Once the zoom/motion partitioning has been achieved, registration can be per- 
formed. If the frame belongs to a camera zoom sequence, then registration is 
performed only on the set of the internal parameters. Otherwise, registration is 
performed only on the set of the external parameters. As described in PI , we 
use n 2D/3D curve correspondences. Once the curves corresponding to the 3D 
features have been detected in the first frame of the sequence, they are tracked 
from frame to frame. 



4.1 Registration for a Camera Motion 



If the frame belongs to a camera motion sequence, we perform a six-parameters 
optimization from the curve correspondences: 



“0 — ^0 — ^0’ 

= argmin 
R,T 



where is a robust distance between 2D curve i and the projection of its 
3-D counterpart. The computation of the residual Vi is detailed in PI- However, 
one of the limitations of using 2D/3D correspondences originates in the spatial 
distribution of the model features: the reprojection error is likely to be large far 
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from the 3D features used for the viewpoint computation. An example is shown 
in Fig0a: the viewpoint has been computed using the buiding in the background 
of the scene (the Opera) . If we add a computer generated car on the foreground 
of the the scene, this car seems to hover. 




Fig. 5. (a) Registration using only 2D/3D correspondences, (b) Registration with the 
mixing method. 



In order to improve viewpoint computation, we propose to use the key-points 
that have being matched for the partitionning stage. Previous approaches at- 
tempted to recover the viewpoint from 2D/2D correspondences alone im ; un- 
fortunately, this approach turns out to be very sensitive to noise in image mea- 
surements. For this reason, points correspondences between frames are here used 
to provide additional constraints on the viewpoint computation. 

Our approach encompasses the strength of these two methods: the viewpoint 
is defined as the minimum of a cost function which incorporates 2D/3D corre- 
spondences between the image and the model as well as 2D/2D correspondences 
of key-points. Note that the extracted key-points bring information in areas 
where the 3D knowledge available on the scene are missing (fig.Elb). 

Given the viewpoint [i?fc,Tfc] computed for a given frame k, we now explain 
how we compute the viewpoint in the next frame fc -I- 1 using the 3D model as 
well as the matched key-points 9fe+i)i<i<Ar- Let q\. be a point in frame k. 
Its corresponding point in frame k + 1 belongs to the intersection of the image 
plane with the plane (Cfe, Cfc+i, g^). This line is called the epipolar line. For two 
matched points {q\, the quality of the viewpoint computed can be assessed 

by measuring the distance vt between ql._^_l and the epipolar line of qk in frame 
k+1 E). Then, a simple way to improve the viewpoint computation using the 
interest points is to minimize 

This way, any a priori information about the scene where the virtual object 
is going to sit on can be included in this model. The A parameter controls the 
compromise between the closeness to the available 3D data and the quality of 
the 2D correspondences between the key-points. We use A = 1 in our practi- 
cal experiments. The minimum of equation El is computed by using an iterative 
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algorithm for minimization such as Powell’s algorithm, initialization being ob- 
tained from the parameters computed in the previous image of the sequence. 
More details about this method can be found in m- 



4.2 Registration for a Zoom 



If the frame belongs to a camera zoom sequence, we get the new intrinsic pa- 
rameters of the camera from equation 0 However, as approximation errors can 
propagate from frame to frame, we prefer to perform a three-parameters op- 
timization from the 2D/3D correspondences. Hence, the camera parameters in 
frame k+1 are deduced from the camera parameters in frame k by the relation: 



jlk+i ^ 

= argmin 
Co,Uq,Vo 



l^k+1 „.k+l ^,k+l _ 
*-^0 1 bin 1 



fc+l 



fe-i-i _ nk+i k 
— On O,, . 



5 Experimental Results 

In this section, we first justify experimentally the use of the threshold Thgcore = 
0.5 to discriminate between zoom variations and camera motions. Then, section 
15.21 present results of the partitioning process. Finally, registration results are 
given and augmented scenes are shown. 



5.1 Choosing Thscore 

To prove that Thscore = 0.5 is well suited to discriminate between camera motion 
and zoom variation, we considered a variety of video sequences (see Fig. EJ. Each 
sequence alternates zoom variations with camera motions, including translations 
along the optical axis Tz- For each frame of the sequence, the labeling in terms 
of zoom variation, rotation motion, translation motion is known. This allows us 
to compare the results of our algorithm with the actual ones. 




l:The cottage sequence 2:The cup sequence 3:The office sequence 4:The Loria sequence 

Fig. 6. Snapshots of the scenes used for testing the zoom/motion partitioning algo- 
rithm. 
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We first compute the score of the zoom hypothesis for each frame of the four 
sequences. Then we compute the mean along with the standard deviation of the 
score for the frames of the sequence corresponding to zoom variation, rotation 
and translation and (more difficult cases) Z-translation and panoramic motion. 
These results are shown in table 0 the first column shows the kind of variation 
undergone by the camera. The second and third columns give the scene under 
consideration and the number of frames in the sequence corresponding to the 
camera variation. Columns 4 and 5 show the mean and the standard deviation 
of the residual computed from the corresponding key-points (see equation ^ . 
Finally, columns 6 and 7 shows the mean and the standard deviation of the score 
of the zoom hypothesis. These results clearly show that the use of the residual 
defined in equation does not permit to discriminate between zoom variations 
and translation along the optical axis. On the contrary, the score we have defined 
gives high values when zoom happens and much smaller results when camera 
motion happens, even in case of Tz translation. Finally, these experiments prove 
that the value Thgcore = -5 is appropriate to distinguish zoom variations from 
camera motions. 



variation in 
the camera 
parameters 


scene 


nb 

frames 


r 


(7r 


mean 

score 


score 


Zoom 


1 


6 


0.617 


0.030 


0.747 


0.055 




2 


4 


0.460 


0.266 


0.860 


0.055 




3 


32 


0.860 


0.057 


0.677 


0.133 




4 


29 


0.515 


0.014 


0.561 


0.064 


Rotation 
+ translation 


1 


10 


3.593 


1.439 


-0.591 


0.171 


Translation 


1 


2 


0.651 


0.020 


0.393 


0.066 


along the 


2 


4 


0.841 


0.018 


0.274 


0.035 


optical axis 


3 


16 


1.380 


0.190 


0.047 


0.277 


Panoramic 

motion 


4 


15 


0.630 


0.066 


-0.209 


0.315 



Table 2. Score of the zoom hypothesis for various camera parameters. 



5.2 Results in Zoom/Motion Partitioning 

We now give detailed results of our algorithm on the cottage sequence and the 
Loria sequence. Note that the camera parameters are known for the cottage 
sequence because the house stands on a calibration target. The Loria sequence 
is a 700-frames sequence which has been shot outside our laboratory. The actual 
camera parameters are not available for this sequence, but we have manually 
partitioned the sequence (see table Elb) to enable comparison with the algorithm. 

For each of the two sequences (Fig. C|), we show the scores computed along 
the sequence, the results of our partitioning algorithm, and the computed zoom 
factor Cq. Also shown in the Fig. Qb and He is the actual partition of the 
sequence for comparison. For the cottage sequence, the algorithm performance is 
quite good and the computed parameters are very close to the actual parameters. 
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image 


motion/zoom 


0 ^ 20 
20 ^ 35 
35 ^ 40 
40 ^ 55 
55 ^ 65 


rotation 40° 
zoom in 

translation 10cm 
zoom out 
rotation —20° 



Image frames 


camera parameters 


0 ^ 120 
121 ^ 344 
345 ^ 408 
409 ^ 600 
601 ^ end 


panoramic motion 
Zoom in 

no motion, nor zoom 
Zoom out 
panoramic motion 



Table 3. Camera parameters during (a) the cottage sequence and (b) the Loria se- 
quence. 



For the Loria sequence, the reader can notice that some scores are higher than 
the threshold during the panoramic motion between frames 0 and 100 (Fig.Qd). 
However, in Fig. 0a and 0d, the test on the invariant point is shown with the 
dash-dot lines: the value 1 indicates that the invariant point is inside the image, 
while the value 0 indicates that the invariant point is outside the image. Using 
this constraint, the results of the partition process is very good (Fig. 0b and 
0e). 




Fig. 7. Results for the cottage sequence (first row) and the Loria sequence (second 
row). 



5.3 Registration Results 

In this section, registration results are shown for the cottage sequence and the 
Loria sequence. As the actual parameters are known for the cottage sequence, Fig. 
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IHlshows the trajectory and the focal length computed with our algorithm (dashed 
lines) along with the actual parameters (solid lines) . The reader can notice that 
the parameters obtained are in close agreement with the actual values. To prove 
the accuracy of the camera parameters, we have augmented the scene with a 
palm tree and a beach umbrella (Fig. 0. Note that the shadows between the 
scene and the computer generated objects greatly improve the realism of the 
composite images. They have been computed from a rough 3D reconstruction 
of the scene given by the corresponding key-points. The reprojection of the 3D 
model features with the computed camera parameters is also shown. The overall 
impression is very good. 




Fig. 8. Comparison of the actual trajectory (a) and focal length Ou (b) (solid lines) 
with the computed ones (dashed lines). 




Fig. 9. Registration results on the cottage sequence: reprojection of the model (first 
row) and snapshots of the augmented scene (second row). 



Registration with a Moving Zoom Lens Camera 593 



We do not have the actual camera parameters for the Loria sequence. Hence 
looking at the reprojection of the model features is a good way to assess the 
registration accuracy. Fig. IH3 exhibits the reprojection of the model every hun- 
dred frames. The reader can notice that the reprojection error is small even at 
the end of the sequence, which proves the efficiency of our algorithm. Finally, we 
augment the sequence with the well known sculpture La femme d la chevelure 
defaite realized by Mird. The interested reader can look at the video sequences 
of our results at URL http://www.loria.fr/~gsimon/eccv2000.html. 




Fig. 10. Registration results on the Loria sequence: the reprojection of the model 
every hundred frames (first row) and snapshots of the augmented scene (second row). 



6 Conclusion 

In this paper we have presented an efficient registration algorithm for a zoom 
lens camera. We restricted our study to the case of image sequences which alter- 
nate zoom variation alone and camera motion alone. This is a quite reasonable 
assumption which is always fulfilled by professional movie-makers. The perfor- 
mance of our algorithm is quite good and our algorithm is capable of discrimi- 
nating between zoom variations and Tz translations. However, our experiments 
show that some improvements and extensions can be made to our approach. 

First, experiments on the Loria sequence show that the camera trajectory is 
somewhat jagged. Smoothing the trajectory afterwards is not appropriate be- 
cause the correspondences between the image and the 3D model are not main- 
tained. We currently investigate methods to incorporate regularity constraints 
on the trajectory inside the registration process. 

Second, as was observed in our experiments, moving objects in the scene 
may perturb the partitioning process. Indeed, the correlation score is always low 
for moving objects and this may lead to false rejection of the zoom hypothesis. 
Detecting moving objects in the scene prior to the registration process could 
help to solve this problem. 
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Abstract. We present a scheme for simultaneous calibration of a con- 
tinuously moving and continuously zooming camera: placing an easily 
distinguishable pattern in the scene, we calibrate the camera from an 
unoccluded portion of the pattern image in each frame. We describe an 
optimal method which provides an evaluation of the reliability of the 
solution. We then propose a technique for avoiding the inherent degene- 
racy and statistical fluctuations by model selection using the geometric 
AIC and the geometric MDL. 



1 Introduction 

Visually presenting 3-D shapes of real objects is one of the main goals of many 
Internet applications such as network cataloging and virtual museums. Today, 
generating virtual images by embedding graphics objects in real scenes or real 
objects in graphics scenes, known as mixed reality, is one of the central themes 
of image and media applications. In order to reconstruct the 3-D shapes of real 
objects or scenes for such applications, we need to know the 3-D position of the 
camera that we use and its internal parameters. Thus, camera calibration is a 
first step in all vision and media applications. 

The standard method for it is pre-ealibration: the camera internal parameters 
are determined from images of objects or patterns of known 3-D geometry in 
a controlled environment nmmmrn- Recently, techniques for computing 
both the camera parameters and the 3-D positions of the camera from an image 
sequence of the scene about which we have no prior knowledge have intensively 
been studied EEl- Such a technique, known as self-calibration, may be useful 
in unknown environments such as outdoors. For stable reconstruction, however, 
it requires a long sequence of images taken from unconstrained camera positions 
and feature matching among frames. As a result, the amount of computation 
is too large for real-time applications, and it cannot be applied if the camera 
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Fig. 1. Simultaneous calibration of a moving camera: we observe an unoccluded part 
of the image of a planar pattern placed in the scene. 



motion is constrained or the scene changes as the camera moves unless we are 
given a priori information about the constraint or the scene change (see, e.g., jHl 
for self-calibration based on a priori information about the camera motion) . 

In this paper, we focus on virtual studio applications [ZED]: we take images of 
moving objects such as persons and superimpose them in a graphics-generated 
background in real time by computing the 3-D positions and zooming of a mo- 
ving camera. Since the scene as well as the position and zooming of the camera 
changes from frame to frame, we cannot pre-calibrate or self-calibrate the ca- 
mera. 

This difficulty can be overcome by placing an easily distinguishable planar 
pattern with a known geometry in the scene (Fig.^: we detect an unoccluded 
portion of the pattern image in each frame, compute the 3-D position and zoo- 
ming of the camera from it, and remove the pattern image by segmentation. We 
call this strategy simultaneous calibration. It has many elements that do not 
appear in pre-calibration: 

1. While manual interventions can be employed in pre-calibration, simultaneous 
calibration must be completely automated. In particular, we must automa- 
tically identify the 3-D positions of the marker points that are unoccluded 
in each frame. 

2. Since the number of unoccluded marker points is different in each frame, the 
accuracy of calibration is different from frame to frame. Hence, not only do 
we need an accurate computational procedure but also a scheme for evalua- 
ting the reliability of the computed solution. 

3. Since we have no control over the camera position relative to the pattern, 
degenerate configurations can occur: when the camera optical axis is perpen- 
dicular to the pattern, the 3-D position and focal length of the camera are 
indeterminate because zooming out and moving the camera forward cause 
the same visual effect. 

4. As the object moves in the scene, some unoccluded marker points become 
occluded while others become occluded. As a result, the computed camera 
position may not be the same even if the camera is stationary in the scene. 
This type of statistical fluctuations becomes conspicuous when the camera 
motion is small. 
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In this paper, we introduce a statistical model of image noise and describe a 
procedure for computing an optimal solution that attains the Cramer-Rao lower 
bound ( CRLB) in the presence of noise. As a result, we can evaluate the reliability 
of the solution by computing an estimate of the CRLB. 

We then show that degeneracy and statistical fluctuations can be avoided by 
model selection. At each frame, we predict the 3-D position and zooming of the 
camera in multiple ways from the past history. We then evaluate the goodness 
of each prediction, or model, and adopt the best one. In this paper, we use the 
geometric AIC introduced by Kanatani mM and the geometric MDL to be 
defined shortly as the model selection criterion. 

The geometric MDL we use is different from the traditional MDL used in sta- 
tistics and some vision applications We compare the performan- 

ces of the geometric AIC and the geometric MDL by doing numerical simulations 
and real image experiments. 

2 Basic Principle 

We fix an XY Z world coordinate system in the scene and place a planar pattern 
in parallel to the XY plane at a known distance d. We imagine a hypothetical 
camera with a known focal length /o placed at the world origin O in such a 
way that the optical axis coincides with the Z-axis and the image x- and y- 
axes are parallel to the X- and Waxes. The 3-D position of the actual camera 
is regarded as obtained by rotating the hypothetical camera by R (rotation 
matrix), translating it by t, and changing the focal length into /; we call {t, 
R} the motion parameters. We regard the focal length / as a single unknown 
internal parameter, assuming that other parameters, such as the image skew and 
the aspect ratio, have already been pre-calibrated so that the imaging geometry 
can be modeled as a perspective projection. 

Suppose N points on the planar pattern with known coordinates (Aq,, Y^, d) 
are observed at {xa,ya) in the image. If we define the 3-D vectors 



Here, Z[-] denotes normalization to make the third component 1, and H is the 
matrix in the following form H2|: 




( 1 ) 



we have the following relationship: 



Z[Hx^]. 



( 2 ) 




(3) 



Throughout this paper, i, j and k denote (1,0,0)^, (0,1,0)^, and (0,0,1)^, 
respectively, and diag(- • •) denotes the diagonal matrix with diagonal elements 
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3 Optimal Computation 

Eq. (0 defines an image transformation called homography . Since the unknown 
parameters are {t, i?} and /, the homography has seven degrees of freedom. If 
the homography is unconstrained with eight degrees of freedom, we can apply 
our statistically optimal renormalization-based algorithm uni; its C-l— I- code is 
available via the WelQ. Here, however, the homography is constrained. So, we 
take the bundle-adjustment approach based on Newton iterations. 

Let ^[cCq,] be the covariance matrix of the data vector x^. We assume that 
it is known only up to scale and write 

V[Xa] = e‘^Vo[Xa]. (4) 

We call the unknown magnitude e the noise level and the matrix Vo[3^a] the 
normalized covariance matrix. Since the third component of a; is 1, is a 

singular matrix of rank 2 with zeros in the third row and the third column. If 
the noise has no particular dependence on position and orientation, it has the 
form diag(l, 1, 0), which we use as the default value. 

If the noise is Gaussian, an optimal estimate of H is obtained by maximum 
likelihood estimation m we minimize the average squared Mahalanobis distance 

1 ^ 

J = -'^{Xa- Z[HXa],Vo[Xa] {x^ ~ Z[HXa])) , (5) 

CK — 1 

where and throughout this paper the operation (■)“ denotes the (Moore- 
Penrose) generalized inverse and (a, b) denotes the inner product of vectors a 
and b. We define the following non-dimensional variables: 

The first order perturbation of R is written a,s R ^ R + Af2 x R, where Af2 
is a 3-D vector and Aflx R is a matrix whose columns are the vector products 
of A f2 and each columns of R m We define the gradient VJ and the Hessian 
J with respect to {cj>, r, R} in such a way that the Taylor expansion of J has 
the form 

J{4>+ A(f>, T + At, R + Af2 x R) 

( Acj>\ . ( A<j,\ ( A(j>\ 

= + Z\r ) + ;.( Lir ,W Z\r ) + •••. (7) 

\Afl) ^ \An J \Af2 J 

The solution that minimizes J is obtained by the following Newton iterations: 

1. Give an initial guess of 0, t, and R. 

2. Gompute the gradient V J and the Hessian J (their actual expressions are 
omitted) . 



^ http://www.ail.cs.gunina-u.ac.jp/ kanatani/e 
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3. Compute A(j), At, and Afl by solving the linear equation 




4. If \A(j)\ < ||Z\x|| < €r, and ||^i7|| < er, return cj), t, and R and stop. 

Otherwise, update cj), t, and R in the form 

(j) i — (j) A(j), T i — T At, R i — R.(^Af2^ R, (9) 

and go back to Step 2. 

The symbol TZ{Af2) denotes the rotation of angle ||2\17|| around Af2; e^, €r, 
and £r are thresholds for convergence. 

The initial guess of (f>, t, and R can be obtained by computing the homogra- 
phy H between {xa} and {xa}, say, by least squares or by the renormalization- 
based method uni without considering the constraint and approximately decom- 
posing it into (f), T, and R in the form of eq. ® ( an analytical procedure for this 
is given in m)- However, this procedure is necessary only for the initial frame. 
For the subsequent frames, we can start from the solution in the preceding frame 
or an appropriate prediction from it, as we will describe shortly. 



4 Reliability Evaluation 



The squared noise level can be estimated from the residual j (the minimum 
value of J) in the following form j 1 2] : 



e^ = 



J 

2-1 IN' 



( 10 ) 



Let J be the resulting Hessian. The covariance matrix of {(j), t, R} is esti- 
mated in the following form: 

v[$,T,R] = ‘^(v^jy\ ( 11 ) 

This gives an estimate of the Cramer-Rao lower bound (CRLB) on V[(f>,T, R] 

ra- 

The (1,1) element of V[(f),T,R] gives the variance V\<j)] of 4>. It follows that 
if the error distribution is approximated to be Gaussian, the 99.7% confidence 
interval of / has the form 

$-3^/v\y]<j-<^+3^/v\y]. (12) 

The submatrix of V[4>,f,R] defined by its second to fourth rows and columns 
gives the covariance matrix V\f] of r. Let AQ and I be, respectively, the angle 
and axis of the rotation RR^ relative to the true rotation R. Let Af2 — AfU. 
The submatrix of V[(f>,T,R\ defined by its fifth to seventh rows and columns 
gives the covariance matrix V\R\ of Z\17. 
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empirical 


CRLB 


focal length (pixies) 


33.4 


34.0 


translation (cm) 


32.9 


32.6 


rotation (deg) 


0.413 


0.414 



Fig. 2. Simulated image of a grid pattern (left); the standard deviations of the opti- 
mally computed solutions and estimates of their Cramer-Rao lower bounds (right). 






Fig. 3. (a) Histogram of the computed focal length, (b) Error distribution of the com- 
puted translation, (c) Error distribution of the computed rotation. 



5 Examples of Reliability Evaluation 

5.1 Numerical Simulation 

Fig .|2|shows a simulated image of a grid pattern viewed from an angle. We added 
Gaussian random noise of mean 0 and standard deviation 1 (pixel) to the x and 
y coordinates of the vertices independently and computed the focal length and 
the motion parameters 1,000 times, using different noise each time. The standard 
deviations of the computed solutions and estimates of their CRLBs are listed in 

Fig.|3 

Fig.E^a) is the histogram of the computed focal length /. The vertical lines 
indicate the estimated CRLB. Fig. m is a 3-D plot of the distribution of the 
error vector At — t — i of translation. The ellipse indicates the estimated CRLB 
in each orientation. Fig. EKc) is a 3-D plot of the error vector Af2 of rotation 
depicted similarly. 

From these results, we can confirm that the estimated CRLB can be used as 
a reliability measure of the solution. 

5.2 Tennis Court Scene 

Fig. Ha) is a real image of a tennis court. Since the size of the court is stipula- 
ted by an international rule, we can compute the 3-D camera position and the 
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(a) (b) (c) 

Fig. 4. (a) A real image of a tennis court, (b) The computed camera position viewed 
from above, (c) A virtual scene generated from (a). 



focal length by using this knowledge. The focal length is estimated to be 955 
pixels. The camera is estimated to be at 627cm above the ground. The standard 
deviations of the focal length, the translation, and the rotation are evaluated to 
be 6.99 pixels, 16.14cm, and 0.151 deg, respectively. 

Fig. ab) shows the top view of the tennis court generated from Fig. Ha). 
The estimated camera position is plotted there and encircled by an ellipse, which 
indicates three times the standard deviation of the estimated position in each 
orientation (actually it is an ellipsoid viewed from above). 

The images of the poles and the persons in Fig. Sb) can be regarded as their 
“shadows” on the ground cast by hypothetical light emitted from the camera, so 
we can compute their heights |5I10| . The right pole is estimated to be 113cm in 
height. The person near the camera is estimated to be 171cm tall. This technique 
can be applied to 3-D analysis of sports broadcasting I25I28I . Since we know the 
3-D structure of the scene, we can generate a virtual view of a new object placed 
in the scene. Fig. 0^c) is a virtual view of a logo placed on the tennis court. 



5.3 Virtual Studio 

Fig. EKa) is a real image of a toy, behind which is placed a grid pattern colored 
light and dark blue. The grid pattern is placed on the floor perpendicularly. The 
camera optical axis is almost parallel to the floor. Unoccluded grid points in 
the image were matched to their true positions in the pattern by observing the 
cross ratio of adjacent points. This pattern is so designed that the cross ratio is 
different everywhere in such a way that matching can be done in a statistically 
optimal way in the presence of image noise ITTiroi . 

After separating the toy image from the background by using a chromakey 
technique, we computed the 3-D position and focal length of the camera by 
observing an unoccluded portion of the grid pattern (see uni for the image 
processing details). The focal length is estimated to be 576 pixels. The standard 
deviations of the focal length, the translation, and the rotation are evaluated to 
be 38.3 pixels, 5.73cm, and 0.812 deg, respectively. 

Fig.0(b) is the top view of the estimated camera position and its uncertainty 
ellipsoid (three times the standard deviation in each orientation). Fig. EKc) is a 
composition of the toy image and a graphics scene generated by VRML. 
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(a) (b) (c) 

Fig. 5. (a) Original image, (b) Estimated camera position and its reliability, (c) A 
virtual scene generated from (a). 



6 Trajectory Stabilization 

If the camera optical axis is perpendicular to the planar pattern, the Hessian 
J in eq. Q is a singular matrix, so the solution is indeterminate. This does 
not occur in practice due to image noise, but the resulting solution is numerically 
unstable. Also, as pointed out in Introduction, the computed camera position 
fluctuates when the camera motion is small. We now present a technique for 
avoiding degeneracy and statistical fluctuations by model selection. 



6.1 Model Selection Criteria 

The homography H given by eq. m is parameterized by {t, i?} and /, having 
seven degrees of freedom. If the motion and zooming of the camera are constrai- 
ned in some way (e.g., the camera is translated without rotation or zooming), 
the homography H has a smaller degree of freedom, and a smaller number of pa- 
rameters need to be estimated. In general, parameter estimation becomes stabler 
as the number of parameters decreases. 

It follows that we can stably estimate the parameters or avoid degeneracy if 
we know the constraint on the camera motion or zooming mm- In practice, 
however, we do not know how the camera is moving or zooming. Our strategy 
here is to assume probable constraints (translation only, etc.), which we call 
models, compare each other, and adopt the best one. A naive idea for this is to 
compute the residual J for each model and choose the one for which it is mini- 
mum. However, this does not work: the general model always has the smallest 
residual, since the residual decreases as the degree of freedom increases. 

The best known criterion for balancing the residual and the degree of the 
freedom of the model is Akaike’s AIC 0 designed for statistical estimation 
and used in some vision applications ^l- Kanatani’s geometric AIC mm is a 
variant of Akaike’s AIC specifically designed for geometric estimation and has 
been applied to a variety of vision applications In the 

present case, the geometric AIC for minimizing eq. (0 is written as 



G-AIC = J-b2fce^ 



(13) 




Calibration of a Moving Camera Using a Planar Pattern 603 



where k is the degree of freedom of the homography H . The square noise 
level is estimated from the general model in the form of eq. m- 

Another well known criterion is Rissanen’s MDL {minimum description 
length) based on the information theoretic code length of the model |26I27| . It 
is derived by analyzing the function space of “stochastic models” identified with 
parameterized probability densities in the asymptotic limit of a large number of 
observations. Here, the models we want to compare are geometric constraints, 
not parameterized probability densities. Also, we are given only one set of data 
(i.e., one observation) for each frame. Hence, Rissanen’s MDL cannot be used in 
its original form. 

The starting point of Rissanen’s MDL is the observation that encoding a 
real number requires an infinite code length. Rissanen’s idea is to quantize the 
parameters to obtain a finite code length, taking into account the fact that real 
numbers cannot be estimated completely 123 . The quantization width is deter- 
mined by attainable estimation accuracy, which in turn is determined by the 
data length n. Since the code length diverges as n — >■ 00 , asymptotic approxima- 
tion comes into play. In this sense, the “minimum description length” actually 
means the “minimum growth rate” of the description length. 

Suppose we hypothetically repeat independent observations, although the 
actual observation is done only once. The accuracy of estimation increases as the 
number of hypothetical observations, so we can define the MDL by asymptotic 
analysis. But increasing the number n of observations effectively reduces the 
noise level e to 0{1/ ^/n). It follows that we can define the MDL as the “growth 
rate” of the description length as e — >■ 0. The final form is as follows (we omit 
the details of the code length analysis): 

G-MDL = J -ke^loge^. (14) 

We call this criterion the geometric MDI0. This form can also be obtained from 
Rissanen’s MDL by replacing n by 1/e^ and is different from any MDLs used 
in statistics and vision applications !^ii in that ours does not contain 

the logarithm of the number of the data. 

6.2 Degeneracy Detection 

If degeneracy occurs, the confidence interval (inj expands infinitely wide if no 
noise exist. In the presence of noise, it has a finite width. We decide that dege- 
neracy has occurred if the confidence interval contains negative values of /. 
This means that we adopt the following criterion: 

V[$\ > y. (15) 

The variance V\(j)\ equals the (1,1) element of the covariance matrix V[(t),T,Ii\ 
given by eq. (Tm . so it is equal to 2e^{V'^ J)\-^/N det(V^ J), where (V^ J)|;^ is the 

^ Since the additive terms can be ignored when e <C 1, changing the unit of length 
does not affect the relative comparison of models asympotitically. 
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(l,l)-cofactor of the Hessian J (the determinant of the submatrix obtained 
by removing the first row and the first column of J). Hence, eq. (II 5 H can be 
rewritten in the form 

> 0. (16) 

Since matrix inversion is no longer involved, this expression can always be stably 
evaluated. 



6.3 Models of Zooming and Motion of the Camera 

We predict the focal length / and the motion parameters {t, R} in the next 
frame from the values fi and {ti, Ri} of the current frame and the values fi-i 
and {ti-i, Ri-i} of the preceding frame. Here, we consider the following six 
models: 

Stationary model: We assume that the camera is stationary: f = fi, t = ti, 
and R = Ri- Let J» be the corresponding residual. This model has zero 
degrees of freedom. 

t-fixed model: We assume that the camera only rotates. We let f = fi and t = 
ti and optimally compute the rotation R by Newton iterations starting from 
Ri. Let Js' be the corresponding residual. This model has three degrees of 
freedom. 

f-predicted model: Assuming that the zooming does not change, we linearly 
extrapolate the camera position and let t = 2ti — ti-\. Then, we optimally 
compute the rotation R by Newton iterations starting from RiRj_iRi. Let 
Jp' be the corresponding residual. This model has three degrees of freedom, 
/-fixed model: Assuming that the zooming does not change, we optimally 
compute the motion parameters {t, ii} by Newton iterations starting from 
{U, R^}■ Let Js be the corresponding residual. This model has six degrees 
of freedom. The square noise level is estimated by 



e1 = 



Js 



2-6/N' 



(17) 



/-predicted model: We linearly extrapolate the focal length and let / = 2fi — 
fi-i- Then, we optimally compute the motion parameters {t, i?} by Newton 
iterations starting from {2tj — t^_i, RiRj_.^^Ri}. Let Jp be the corresponding 
residual. This model has six degrees of freedom. The square noise level is 
estimated by 



_ Jp 
P 2-6/N' 



(18) 



General model: We optimally compute the focal length / and the motion pa- 
rameters {t, i?} by Newton iterations starting from the solution obtained 
from the /-predicted model. Let Jg be the corresponding residual. This mo- 
del has seven degrees of freedom. 
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Degeneracy is detected from the /-predicted model. Namely, we estimate the 
square noise level by eq. 113) and evaluate the criterion (unj. If degeneracy 
is not detected, we compare the stationary model, the /-fixed model, the /- 
predicted model, and the general model. Estimating the square noise level 
by eq. (HDl, we evaluate the geometric AICs and the geometric MDLs of these 
models in the following form: 

G-AIC, = J,, G-AIC, = J, + ^e^, G-AIGp = Jp+^e^, 

G-AIGg = Jg + ^e‘^, G-MDL, = J,, G-MDL, = J, - log 

G-MDLp = Jp - ^e^loge^, G-MDLg = Jg - log e^. (19) 

The model that gives the smallest AIG or the smallest MDL is chosen. 

If degeneracy is detected, we compare the stationary model, the t-fixed mo- 
del, the t-predicted model, and the /-fixed model. Estimating the square noise 
level by eq. (ED, we evaluate the geometric AIGs and the geometric MDLs of 
these models in the following form: 

G-AIG* = J* , G-AIG,, = Js' + ^ e1 , G-AIGp, = Jp, + ^ el 

G-AIGs = Js -|- G-MDL, = J,, G-MDE^, = — — egloge^, 

G-MDLp, = Jp, - pi log G-MDL, = 1 ~ pi log el (20) 
The model that gives the smallest AIG or the smallest MDL is chosen. 

7 Model Selection Examples 

7.1 Numerical Simulation 

We simulate a camera motion in a plane perpendicular to a 3 x 3 grid pattern. In 
the course of its motion, the camera is rotated so that the center of the pattern 
is always fixed at the center of the image frame. First, the camera moves along a 
circular trajectory as shown in Fig. 0(a). It perpendicularly faces the pattern at 
frame 13 and stops at frame 20. The camera stays there for five frames (frames 
20 24) and then recedes backward for another five frames (frames 25 ~ 30). 

Adding random Gaussian noise of mean 0 and standard deviation 1 (pixel) to 
each coordinate of the grid points independently at each frame, we compute the 
focal length and the trajectory of the camera (Figs. Elb) and Uc)). Degeneracy 
is detected at frames 12 and 13. In order to emphasize the fact that the frame- 
wise estimation fails, we let / be oo and the camera position be at the center of 
the grid pattern in Figs. Elb) and EIc) when degeneracy is detected. 
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Fig. 6. (a) Simulated camera motion, (b) Estimated focal lengths, (c) Estimated ca- 
mera trajectory, (d) Magnification of the portion of (c) for frames 20 ~ 24. In (b)~(d), 
the solid lines indicate model selection by the geometric AIC; the thick dashed lines in- 
dicate model selection by the geometric MDL; the thin dotted lines indicate frame- wise 
estimation. 



As we can see, both the geometric AIC and the geometric MDL produce a 
smoother trajectory than frame-wise estimation and that the computed trajec- 
tory smoothly passes through the degenerate configuration. Fig. El(d) is a ma- 
gnification of the portion for frames 20 ~ 24 in Fig. Efc). We can observe that 
statistical fluctuations exist if the camera position is estimated at each frame 
independently and that the fluctuations are removed by model selection. 

From these results, it is clearly seen that the geometric MDL has a stronger 
smoothing effect than the geometric AIC. This is because the penalty — log 
for each degree of freedom in the geometric MDL is generally larger than the 
penalty 2e^ in the geometric AIC (see eq. m and eq. (d) so the geometric 
MDL tends to select a simpler model than the geometric AIC. 



7.2 Virtual Studio 

Fig. □ shows five sampled frames from a real image sequence obtained in the 
setting described in Section 5.3. The camera moves from right to left with a 
fixed focal length. The camera optical axis becomes almost perpendicular to the 
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Fig. 7. Sampled frames from a real image sequence. 
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Fig. 8. (a) Estimated focal lengths, (b) Estimated camera trajectory. In (a) and (b), 
the solid lines indicate model selection by the geometric AIC; the thick dashed lines 
indicate model selection by the geometric MDL; the thin dotted lines indicate frame- 
wise estimation. 



grid pattern in the 15th frame. Degeneracy is detected there and thereafter. 

Fig. El a) shows the estimated focal lengths; Fig. Efb) shows the estimated 
camera trajectory viewed from above. The frame-wise estimation fails when de- 
generacy occurs. In this case, the estimation by the geometric MDL is more 
consistent with the actual camera motion than the geometric AIC. But this is 
because we fixed the zooming and moved the camera smoothly. If we added va- 
riations to the zooming and the camera motion, the geometric MDL would still 
prefer a smooth motion. So, we cannot say which solution should be closer to 
the true solution; it depends on what kind of solution we expect is desirable for 
the application in question. 



8 Concluding Remarks 

Motivated by virtual studio applications, we have studied the technique for “si- 
multaneous calibration” for computing the 3-D position and focal length of a 
continuously moving and continuously zooming camera from an image of a pla- 
nar pattern placed behind the object. We have described a procedure for com- 
puting an optimal solution that provides an evaluation of the reliability of the 
solution. 

Then, we showed that degeneracy of the solution and statistical fluctuations 
of computation can be avoided by model selection: we predict the 3-D position 
and focal length of the camera in multiple ways and select the best model using 
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the geometric AIC and the geometric MDL. Doing numerical and real-image 
experiments, we have observed that the geometric MDL tends to select a simpler 
model than the geometric AIC, thereby producing a smoother and more cohesive 
estimation. 
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Abstract. In this paper we describe an efficient method to impose the 
constraints existing between the collineations which can be computed 
from a sequence of views of a planar structure. These constraints are 
usually not taken into account by multi-view techniques in order not to 
increase the computational complexity of the algorithms. However, im- 
posing the constraints is very useful since it allows a reduction in the 
geometric errors in the reprojected features and provides a consistent set 
of collineations which can be used for several applications such as mosai- 
cing, reconstruction and self-calibration. In order to show the validity of 
our approach, this paper focus on self-calibration from unknown planar 
structures proposing a new method exploiting the consistent set of col- 
lineations. Our method can deal with an arbitrary number of views and 
an arbitrary number of planes and varying camera internal parameters. 
However, for simplicity this paper will only discuss the case with con- 
stant camera internal parameters. The results obtained with synthetic 
and real data are very accurate and stable even when using only a few 
images. 



Keywords: Self-calibration, Homography. 

1 Introduction 

The particular geometry of features lying on planes is often the reason for the 
inaccuracy of many computer vision applications (structure from motion, self- 
calibration) if it is not taken explicitly into account in the algorithms. Intro- 
ducing some knowledge about the coplanarity of the features and about their 
structure (metric or topological) can improve the quality of the estimates [1 2] . 
However, the only prior geometric knowledge on the features that will be used 
here is their coplanarity. Two views of a plane are related by a collineation. 
Using multiple views of a plane we obtain a set of collineations which are not 
independent. If there are multiple planes in the scene there will be a set of col- 
lineations for each plane and again some constraints between the different sets. 

D. Vernon (Ed.): ECCV 2000, LNCS 1843, pp. 610-ESI 2000. 
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In order to avoid solving non-linear optimisation problems, the constraints exi- 
sting within a set of collineation and between sets have often been neglected. 
However, these multi-view constraints can be used to improve the estimation of 
the collineations matrices as in where multiple planes (> 2) are supposed 
to be viewed in the images. In this paper we analyse the constraints existing 
between a set of collineations induced by a simple plane in the image but it is 
very easy to extend our analysis to the case of multiple planes. Imposing the 
constraint is useful since it allows the reduction of the geometric error in the 
reprojected features and provides a consistent set of collineations which can be 
used for several applications as mosaicing, reconstruction and self-calibration. 

In this paper we will focus on camera self-calibration. Camera self-calibration 
from views of a generic scene has been widely investigated and the two main ap- 
proaches are based on the properties absolute conics unmni or on some algebraic 
error [7j 0 . Depending on the a priori information provided the self-calibration 
algorithms can be classified as follows. Algorithms that use some knowledge 
of the observed scene: identifiable targets of known shape 0, metric structure 
of planes CH Algorithms that exploit particular camera motions: translating 
camera or rotating camera j^. Algorithms that suppose known some of the ca- 
mera parameters: some fixed camera parameters (i.e. skew zero, unit ratio ...), 
varying camera parameters m 0- Camera self-calibration from planar scenes 
with known metric structure has been investigated in several papers. However, 
it is interesting to develop flexible techniques which do not need any a priori 
knowledge about the camera motion as in 0 or metric knowledge of the pla- 
nar scene. A method for self-calibrating a camera from views of planar scenes 
without knowing their metric structure was proposed in P!- Triggs developed a 
self-calibration technique based on some constraints involving the absolute qua- 
dric and the scene-plane to image-plane collineations. However, in practice it is 
not possible to estimate these collineations without knowing the metric struc- 
ture of the plane. Only the collineations with respect to a reference view (a key 
image) can be used to self-calibrate a camera with constant internal parameters. 
As noticed by Triggs, inaccurate measurements or poor conditioning in the key 
image contribute to all the collineations reducing the numerical accuracy or the 
stability of the method. The aim of this paper is to investigate how to improve 
the self-calibration from planar scenes with unknown metric structure. We will 
not use any key image but all the images of the sequence are treated equally 
averaging the uncertainty over all of them. 

This paper is organised as follows. In Section 2 we review the relationship 
existing between two views of coplanar features and some properties of the col- 
lineation matrices. In Section 3 we generalise the two- view geometry to multiple 
views introducing the super-collineation matrix to describe a set of collineations. 
Then, we describe a simple algorithm to impose the constraints existing between 
the collineation of the set. Finally, we describe some constraints on the camera 
internal parameters which can be used for self-calibration. In Section 4 we give 
the results obtained with both synthetic and real data. 
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2 Two-View Geometry of a Plane 



In this section we describe the relationship between two views of a planar struc- 
ture. Each camera performs a perspective projection of a point x G (with 
homogeneous coordinates x = [X E Z l]^) to an image point p G P^ (with 

homogeneous coordinates p = [u z) l]^) measured in pixels: p oc K [R t] x, 
where R and t represent the displacement between the frame T attached to the 
camera and an absolute coordinate frame and K is a non-singular (3 x 3) 
matrix containing the intrinsic parameters of the camera: 



K = 



fku -fkucot{0) uq 
0 fky/ sm{9) Vo 
0 0 1 



( 1 ) 



where uq and vq are the coordinates of principal point (in pixels), f is the focal 
length (in metres), and ky are the magnifications respectively in the it and 
it direction (in pixels/metres) and 9 is the angle between these axes. 



2.1 The Collineation Matrix in Projective Space 

Let J-i and be two frames attached respectively to the image li and Ij. The 
two views of a planar object are related by a collineation matrix in projective 
space. Indeed, the image coordinates Pik of the point Vk in the image li can be 
obtained from the image coordinates pjk of the point Vk in the image Ij: 

Pik ^ ^ijlt^jk ( 2 ) 

where the collineation matrix is a (3 x 3) matrix defined up to scalar factor 
which can be written as: 

Gy oc K,HyK-i (3) 

where Hy is the corresponding homography matrix in the Euclidean space. 
Homography and collineation are generally used to indicate the same projective 
transformation from P” to P" (in our case n = 2). In this paper we will use the 
term “homography” to indicate a collineation expressed in Euclidean space. 

A relationship similar to equation @ exists between the projections lik and 
Ijk in the two images li and Ij of a 3D line Ck' 

\ik oc G^^- \jk (4) 

The estimation of the collineation matrix is possible both from equation (0 
and/or equation (Q. However, for simplicity we will analyse only the case of 
points since the same results can be applied for lines. 



Multi-view Constraints between Collineations 



613 



2.2 The Homography Matrix in Euclidean Space 

The homography matrix can be written as a function of the camera displacement 
and the normal to the plane j2|: 

tri nj 

H,, = R,, + (5) 

where Ry and t^ are respectively the rotation and the translation between the 
frames Ti and Uj is the normal to the plane tt expressed in the frame 
and dj is the distance of the plane tt from the origin of the frame . From 

can be estimated from Gij if we know the camera internal parameters of 
the two cameras: 

H,,cxK-iG,,K, (6) 

Three important properties of the homography matrix will be extended to the 
multi- view geometry in the next section: 

1 . the Euclidean homography matrix is not defined up to a scale factor. If 
the homography is multiplied by a scalar 7 (H = 7H), this scalar can be 
easily recovered. If svd(H ) = [cti (T2 CT3] are the singular values of H in 
decreasing order, ci > (T2 > cts > 0, then 7 is the median singular value of 
H : 7 = median(svd(H )) = a2- Indeed, the matrix H has a unit singular 
value m and this property can be used to normalise the homography matrix. 

2. from equation it is easy to show that the homography matrix satisfies the 
following equation Vfc > 0 (where [n^]^ and [n^]^ are the skew symmetric 
matrices associated with vectors n^ and nj which represent the normal to 
the plane expressed respectively in the image frame tFi and J-j): 

[n,]^,Hj, = H,,[n,]^, ( 7 ) 

This equation provides useful constraints. If A: = 1 , the matrix [n^]^ Hjj = 
[n^],^ Ry has similar properties to the essential matrix (i.e. E = [t]^ R). 
Indeed, this matrix has two equal singular values and one equal to zero. 
This means two constraints each homography on the camera internal pa- 
rameters |E| which can be used for the self-calibration as in If fc = 2, 
knowing that [n] = nn^ — I, equation CD can be written: 

n,nf H]; - H,,n,nJ = H]; - H,, (8) 

and provides equations that will be used to compute and rij. 

3 . a very important relation can be obtained from equation ® (with fc = 1 ) 
and will be used to compute and nj: 



[n,], = H,, [n,]^ Hj. 


(9) 


Indeed, since det(M)M [v]^ ^ then: 






(10) 



Qy = det(H,j) T 



where: 



( 11 ) 
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3 Multi-view Geometry of a Plane 

In this section we describe the relationships between several views of a planar 
structure. We will point out that a super matrix of 2D collineations among m 
views has rank 3 and we will show how to enforce the rank property in an iterative 
procedure. The properties of the corresponding super matrix of 2D homographies 
provide the necessary constraint for the self-calibration of the camera internal 
parameters. In what follows we will describe the case when only one planar 
structure is used but the extension to more than one plane is straightforward. 

3.1 The Super-Collineation Matrix 

If m images of an unknown planar structure are available, it is possible to com- 
pute m{m — 1) collineations (m collineations are always equal to the identity 
matrix). Let us define the super-collineation matrix as follows: 



Gn • 




Grml ’ 


^mm 



with dim(G) = (3m, 3m) and rank(G) = 3. The rank of G can not be less than 
three since Gu = I3 i G { 1 , 2 , 3, ..., m}, and cannot be more than three since 
each row of the matrix can be obtained from a linear combination of three others 
rows: 

Gij = GikGkj Vf , j, fc £ {1, 2, 3, ..., m} (13) 

This is a very strong constraint which is generally never imposed. Indeed, it 
would require a complex nonlinear minimisation algorithm over all the images. 
The constraints m can be summarised by the following equation: 

G^ = m G (14) 

Then, matrix G has 3 nonzero equal eigenvalues Ai = A 2 = A 3 = m and 3(m— 1) 
null eigenvalues A 4 = A 5 = ... = — 0 - If we can impose the constraint 

G^ = m G (with G^^ = 131 = 1,2,3, then this is in fact equivalent to 

imposing the constraints Gij = GikGkj. 

Imposing the constraints In order to impose the constraint, we exploit 
the properties of the super-collineation matrix. Let pij be the j-th point (j = 
{1, 2, 3, ..., n}) of the i-th image (i = {1, 2, 3, ..., m}). The j-th point in all the 
images can be represented by the vector of dimension (3m, 1) (which we will call 
a super-point): pj = [p^ p^^ • • • p^j ] . Generalising equation Q we obtain: 

r,p, = Gp, (15) 

where Tj = diag( 7 ijl 3 , 724 T 3 , ...jjmjia) is a diagonal matrix relative to the set 
of points j . Then, multiplying both sides of equation m by G we have: 

GFjPj = G^pj = mGpj = raFjPj 



(16) 
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The vector = FjPj (representing the homogeneous coordinates of the point 
j in all the images) is an eigenvector of G corresponding to the eigenvalue m: 

Gpj = mpj (17) 

As a consequence any super-point can be obtained as a linear combination of 
the eigenvectors of G corresponding to the eigenvalue X = m: 



Pj = OiXi -I- 02X2 -I- 03X3 



(18) 



The matrix G can always be diagonalised and thus three linearly independent 
eigenvectors always exist, i.e., 3X : X“^GX = diag{Xi, X 2 , ■■■, X^m)- The co- 
lumns of the matrix X are in fact eigenvectors of G. Since X is nonsingular, 
the eigenvectors of G are linearly independent and span the space That 

means that an initial estimation p G of the super-point p can be written 
as p = oixi -I- 02 X 2 + 013 X 3 -I- ... + 03 mX 3 m. The real super-point p is an eigen- 
vector of G corresponding to the largest eigenvalue X = m. We can thus use a 
well-known algorithm to find an eigenvector of G starting from p. Lets multiply 
our vector by TQ: 

= — Gp = — (oiGxi -|- O 2 GX 2 -|- 03 Gx 3 -|- ... -|- 03mGx3j„) (19) 

m m 



and then replace each Gxj, with its corresponding A^x^. Factoring out Ai we 
have: 

■^1 ( , ^2 As Asm A 

P = OiXi -I- — 02 X 2 -I- — O 3 X 3 -I- ... -I- 03 mX 3 m (20) 

TTi \ Ai Ai Ai J 

In a similar way, iterating the procedure k times we obtain: 






OiXi -I- 




k 

02X2 -I- 




k 

«3X3 + ... + 



^3m 

Ai 






( 21 ) 

This algorithm will converge to the eigenvector of G corresponding to the highest 
eigenvalue since all the fractions A^/Ai that are less than unit in magnitude 
become smaller as we raise to higher powers. In our case, if we knew exactly the 
super-collineation matrix, the algorithm would converge after only one iteration 
since we have Ai = A 2 = A 3 = to and Afe = 0V3<fc< 3 to and the new 
estimated super-point will satisfy the constraint of being an eigenvector of G 
which means that the noise has been reduced. 

In practice, the real super-collineation matrix G is unknown and we must use 
an approximation G estimated from the noisy points in the images. The algo- 
rithm used is the following. We start with a set of n points Pj {j = 1, 2, 3, ..., n) 
and compute the super-collineation matrix G solving independently the linear 
problem of estimating each block G^ from equation (0 . It is not necessary that 
all the points are visible in all the images. Then, we compute a new set of super- 
points trying to impose the constraint. The better the estimate of G we obtain 
the faster the algorithm will converge and the more accurate will be the results. 
At iteration k the algorithm is: 
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(i) estimate the super-collineation matrix 

Pj{k) ^ G(fc) (22) 

(ii) compute the new super-point 

pJk + l) = -G{k)pJk) (23) 

m 

This algorithm treats all the images with the same priority without using any 
key image and forces the rank 3 constraint on G. We now show some simulation 
results which demonstrate the validity of our approach (we will describe in the 
next section O how to use the consistent set of collineation matrix in order to 
perform the self-calibration of the camera) . 




100r 



-B- theoretical G with 
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theoretical G with 
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O estimated G with 
-A- theoretical G with 
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= 1 
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40 50 60 

number of images 
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Fig. 1. Geometric noise reduction obtained imposing the rank 3 constraint on the 
super-collineation matrix. 



Simulation results The algorithm was tested on a simulated planar grid of 64 
points. It converges after 1 or 2 iterations imposing the constraint on the rank 
of the matrix G. The error between the theoretical position of the points (which 
is never used in the algorithm but only for its evaluation) and the transformed 
coordinates is greatly reduced. FigureQshows the results obtained from random 
views with a random axis of rotation and a 30 degrees angle of rotation with 
respect to a fixed position. On the horizontal axis are given the number of 



Multi-view Constraints between Collineations 



617 



images used and on the vertical axis the corresponding rate of noise reduction. 
The continuous line gives us an upper bound of the reduction rate that could be 
possible if the super-collineation was known exactly. This reduction is practically 
independent of the level of noise and all continuous lines are superposed in 
Figured The dashed dotted lines represent the results obtained estimating the 
super-collineation matrix with differents level of noise (0.1 < cr^ < 1000). The 
rate of noise reduction does not vary significantly with the amplitude of the noise. 
However, when the level of noise increases the reduction rate decreases since the 
estimation of the super-collineation matrix becomes less accurate. Finally, we 
obtain only a small improvement when increasing the number of the images 
from 50 to 100. In the simulation results described in sectioned we obtain very 
similar results varying the camera internal parameters and the angle of rotation 
between the images. 



3.2 The Super-Homography Matrix 

Let us define the super-homography matrix in the Euclidean space as: 



H = 



Hn 






'^^mm 



(24) 



with dim(H) = (3m, 3m) and rank(H) = 3. The super-homography matrix can 
be obtained from the super-collineation matrix and the camera parameters: 



H = K-^GK 



where (dim(K) = (3m, 3m) and rank(K) = 3m): 



K = 



Ki ••• 0 

0 •••K„ 



(25) 



(26) 



is the matrix containing the internal parameters of all the cameras. It should 
be noticed that if the constraint = mG was imposed, then the constraint 
H2 = mH is automatically imposed which means that the following constraints 
are satisfied: 

Hy = HifcHfcj (27) 

Unlike the super-collineation matrix, the super-homography matrix is not defined 
up to a diagonal similarity. Indeed, if denotes the median singular value of the 
matrix Hy we can build the following matrix which contain all the coefficients of 
normalisation: D = diag(crnl 3 , ( 712 I 3 , ..., crimla)- The super homography matrix 
is thus normalised as follows: 



H = DHD-i (28) 

From this equation we can easily see that the constraint = mH holds. In 
the presence of noise, normalising H with equation will conserve the rank 
constraint of the matrix since it is a similarity transformation. 
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3.3 Super-Homography Decomposition 

After normalisation, the homography matrix can be decomposed as: 

H = R+TN'^ (29) 

where: 
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■ R-lrn 


,T = 


hs ■■■ 
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t Im 

dm 


,N = 
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0 • 
ri2 • 
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0 0 • 








^ml 


^mm 




_ 0 


0 • 


■ 








- di 


dm - 





with dim(R) = (3m, 3m), rank(R) = 3, dim(T) = (3m, m) and dim(N) = 
(3m, m). Matrix R is a symmetric matrix, R = R^ and R^ = mR. As a con- 
sequence not only are the three largest eigenvalues Ai = A2 = A3 = m but also 
the three larg est singular values are cti = ct 2 = era = m. In | 2 | and HS| are 
presented two different methods for decomposing the homography matrix, com- 
puted from two views of a planar structure, following equation (0. In general, 
there are two possible solutions but the ambiguity can be resolved by adding 
more images. Here we present a method to decompose any set of homography 
matrices. Equation m can be generalised as follow: 



Q = 



Qll ■ ■ • Qlm 



Qml ’ ’ ’ Q 



mm 



= WH^W-i 



(31) 



where W = diag(l, det(H2i), ..., det(H„^i)) and dim(Q) = (3m, 3m) and rank(Q) = 
3. Matrix Q has similar properties to the matrix H, for example, it has an 
eigenvalue A = m of multiplicity three. The vector n is an eigenvector of Q 
corresponding to the eigenvalue A = m: 



Qn = mn (32) 

where n = [n^n^...n^]^. The vector can be written as a linear combination of 
the eigenvectors n = a; Vi -|- ?/ V2 -I- z V3 = Vx, where x = [x y z] is a vector 
containing three unknowns and V = [vi V 2 V 3 ] is a known matrix. Imposing 
the constraint ||nfc|| = 1 and the constraints given by equation (0) we obtain: 

V.xx^Vf nj, - Hy V.xx^Vj = Hj, - H,, (33) 

from which is possible to compute the unknown matrix xx^ and then, by singular 
values decomposition, the original unknown which is x. Once find x, the normals 
to the plane are extracted from H and knowing that RN = NO^ we find: 



T = HN - 03,„N 

R = H(NN^ - l3„) + N^03™N 



(34) 

(35) 
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3.4 Camera Self- Calibration 

The super-homography can of course be used in many applications. In this sec- 
tion we use the properties of the set of homography matrices to self-calibrate the 
cameras. It should be noticed that we avoid the use of a bundle adjustment tech- 
nique to impose the rank 3 constraint on the super-homography (as explained 
in section 3) and thus we considerably simplify the algorithm. In this case, the 
only unknowns are the camera internal parameters. Each independent homogra- 
phy will provide us two constraints on the parameters according to equation ( 0 . 
Indeed, if afj and are the two non-zero singular values of [rii] ^ Hj^ our self- 
calibration method is based on the minimisation of the following cost function 



A minimum of 3 independent homography matrices (4 images) is sufficient to 
recover the focal length and the principal point supposing r = 1 and 9 — it J2 
and a minimum of 4 independent homography matrices (5 images) is sufficient 
to recover all the parameters. 

4 Experiments 

The self-calibration algorithm has been tested on synthetic and real images. 
The results obtained with a calibration grid were compared with the standard 
Faugeras-Toscani method 0. Our self-calibration algorithm is the following: 

1. Match corresponding points in m images of a planar structure; 

2. Compute the super-collineation imposing the rank 3 constraint using the 
algorithm described in Section 3.1; 

3. Using an initial guess of the camera parameters compute the normalised 
super-homography matrix as described in Section 3.2; 

4. Decompose the super-homography matrix and find the normal to the plane 
as described in Section 3.3; 

5. Compute a new set of camera parameters which minimise the cost function 
given in Section 3.4 and go to step 3. 

4.1 Simulations of a Planar Grid 

The planar grid used for the simulations in section 3 was used to test the self- 
calibration algorithm. The experimental setup is as close as possible to the one 
proposed by Triggs H3- The cameras roughly fixate a point on the plane from 
randomly generated orientations varying ±30° in each of the three axes. The 
nominal camera calibration is / = 1000, r = 1, 9 — 90°, u = 250 and v = 250. 
The plane contains 64 points projected into a 500 x 500 image. The camera 
calibration varies randomly about the nominal values of cr/ = ±30%, ar = ±10%, 
ag = 0.5° and Gu = <Jv = ±75 pixels (tr/ and are standard deviations of log- 
normal distributions while ag, cr„ and of normal ones). 



00 : 




(36) 



620 E. Malis and R. Cipolla 




(a) E(e/) 



(b) E(e„) 



(c) E(6„) 






(d) E(e.) 



(e) E(e,) 



(f) failure rate 






(g) E(e/) r=l, 61=90° (h) E(e„) r=l, 0=90° (i) E(e„) r=l, 0=90° 



Fig. 2. Simulations results of camera self-calibration using a planar grid. The graphs 
(a), (b), (c), (d) and (e) show the mean of the errors on the camera internal parameters 
(respectively /, 0, r, u, v) obtained with the self-calibration algorithm. Graph (f) shows 
the failure rate of the algorithm. Finally, the graphs (g), (h) and (i) show the mean of 
the errors on the camera internal parameters (respectively /, u, v) obtained supposing 
0 = 90° and r = 1. 

In FigureOwe give the results obtained using 6 (lines marked with a square), 
8 (lines marked with a triangle) and 10 (lines marked with a circle) images and 
supposing all the camera parameters unknowns. The figure represents the mean 
error computed on 100 trials with different parameters and different camera 
positions for each level of noise (the standard deviation of the Gaussian noise 
added to the coordinates of the points is increased from 0 to 5). The error on 
the principal point is given as a percentage of the focal length. The errors on 
the camera parameters increase with the noise and decrease with the number of 
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images used for self calibration. Using six images the results are still satisfactory 
even if the failure rate of the method increases rapidly (a failure of the method 
occurs when the obtained focal length is less than 100 pixels). This can be 
explained since it is known that with few images there is the risk of degenerate 
configurations CH. The results obtained compare very favourably to the results 
obtained by Triggs in 0, especially considering the failure rate of the method. 
For example, using 10 images we obtain a mean error more than 50 % smaller 
than the error obtained by Triggs. Finally, in Figure El)g), (h) and (i) we give 
the results of our method when r and 9 are fixed to their nominal values. The 
error on the focal length is practically the same. On the other hand, the error 
on the location of the principal point is reduced since the non-linear search is 
now done in a three-dimensional space reducing the risk of local minima. 



4.2 Self-Calibration from Images of a Grid and Comparison with 
the Standard Faugeras-Toscani Calibration Method 

A sequence (26 images of dimension (640 x 480)) of a calibration grid was taken 
using a Fuji MX700 camera with a 7mm lens. Figure 01 shows three images of 
the sequence. The corners of the black squares are used to compute the super- 
collineation matrix in order to self-calibrate the camera with our method. 




Fig. 3. Three images of the sequence taken with a digital camera. The calibration grid 
allows the “ground truth” to compare our method with the standard Faugeras-Toscani 
method. The main advantage of nsing onr method is that we don’t need any knowledge 
of the 3D structure of the grid to calibrate the camera. On the other hand, at least 
five images are needed. 

Table □ gives the results for the following experiment: 

— non-planar calibration: the mean and the standard deviation on 26 images 
of the grid calibrated with the standard Faugeras-Toscani method initialised 
with the DLT linear method Pj . In this case we use both planes to calibrate 
the camera; 

— planar self-calibration: the mean and the standard deviation on 50 tests using 
m images (m = 6,8,10) randomly chosen between the 26 images of the grid. 
The same tests are repeated using the right plane alone, the left plane alone 
and then again with r and 9 fixed to nominal values. 
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The results are very good and agree with the simulations. They are more accurate 
than the results obtained by Triggs with a sequence of images of a calibration 
grid. In our case, the angle of rotation between the images of the sequence can be 
greater than 60° which has in general the effect to improve the results. However, 
this is not always true since the planes can be very close to the optical center of 
the camera (see in Figure 0 and in this case the estimation of the collineations is 
not accurate. The calibration obtained using the right plane is again very similar 
to the calibration obtained using the left plane. As we expected the accuracy 
decreases as we decrease the number of images but the worst result (obtained 
using only 6 images of the grid) is only an error of 2% on the focal length. 



calibration method 


f 


r 


9 


U 


V 


DLT linear 


685 ± 3 


1.0005 ± 0.0033 


90.00 ± 0.14 


322 ± 5 


229 ± 4 


Faugeras-Toscani 


685 ± 3 


1.0003 ± 0.0022 


90.00 ± 0.16 


322 ± 5 


229 ± 4 


right plane (10 im) 


680 ± 8 


0.9976 ± 0.0088 


89.23 ± 0.59 


318 ± 8 


230 ± 8 


left plane (10 im) 


680 ± 6 


0.9943 ± 0.0058 


89.89 ± 0.30 


320 ± 7 


232 ± 4 


right plane ( 8 im) 


681 ± 12 


0.9950 ± 0.0105 


89.21 ± 0.80 


315 ± 11 


232 ± 9 


left plane ( 8 im) 


678 ± 12 


0.9969 ± 0.0075 


90.06 ± 0.24 


327 ± 14 


233 ± 3 


right plane ( 6 im) 


686 ± 13 


0.9891 ± 0.0126 


89.63 ± 0.69 


312 ± 18 


231 ± 11 


left plane ( 6 im) 


685 ± 10 


0.9886 ± 0.0147 


89.80 ± 0.59 


339 ± 18 


232 ± 7 


Faugeras-Toscani 


685 ± 3 


1 ± 0 


90 ± 0 


322 ± 6 


229 ± 4 


right plane (10 im) 


679 ± 6 


1 ± 0 


90 ± 0 


318 ± 5 


224 ± 8 


left plane (10 im) 


675 ± 6 


1 ± 0 


90 ± 0 


325 ± 4 


232 ± 4 


right plane ( 8 im) 


687 ± 6 


1 ± 0 


90 ± 0 


323 ± 3 


231 ± 6 


left plane ( 8 im) 


676 ± 4 


1 ± 0 


90 ± 0 


343 ± 27 


231 ± 12 


right plane ( 6 im) 


676 ± 8 


1 ± 0 


90 ± 0 


314 ± 9 


227 ± 5 


left plane ( 6 im) 


677 ± 18 


1 ± 0 


90 ± 0 


327 ± 34 


230 ± 31 



Table 1. Results using digital images of the grid (statistics on 50 tests) 



After the camera has been calibrated, the 3D reconstruction of the planes 
was realized. If and nij are respectively the normal to the right and left 
plane in the frame attached to the image Tj, the angle between them is: 

= cos~^(n^jnij) 

This angle should be the same for all the images j = l,2,3,...,m. In order to 
verify the quality of the reconstruction results we can compute the mean and 
the standard deviation a over all the images. For example, the results obtained 
with a sequence of m = 10 images were: 



- m 

= -E^^= 89.84, 



G = 






A 



Y, [4'i - 0" = 0-21 
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4.3 Self- Calibration from Images of a Facade 

In this experiment, in order to test our algorithm in very extreme conditions, 
only four images of a facade (see Figure 01) were taken with the same digital 
camera. With such images the localisation of the corners was not accurate and 
with only four images we can only calibrate the focal length and the principal 
point (thus we fixed r = 1 and d = 90°). The results obtained using our self- 
calibration with 56 points (the corners of the windows on the facade) are / = 678 
(1 % of the mean focal length obtained with the Faugeras-Toscani method and 
the calibration grid), u = 355 and v = 216. The results are very good even using 
images of a roughly planar structure. 




Fig. 4. Four images of a facade. The corners of the windows (marked with a white 
cross) belong roughly to a plane. They are used to compute the super-collineation 
matrix from which it is possible to self-calibrate the camera. 



5 Conclusion 

In this paper we presented an efficient technique to impose the constraints exi- 
sting within a set of collineation matrices computed from multiple views of a 
planar structure. The obtained set of collineations can be used for several appli- 
cations such mosaicing, reconstruction and self-calibration from planes. In this 
paper we focused on self-calibration proposing a new method which does not 
need any a priori knowledge of the metric structure of the plane. The method 
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was tested both with synthetic and real images and the obtained results are very 
good. However, the method could be improved by imposing further constraints 
in order to obtain not only a consistent set of collineations matrices but also a 
consistent set homography matrices. The method could also be improved using 
a probabilistic model for the noise. 
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Abstract. This paper describes a method for autocalibrating a stereo rig. 
A planar object performing general and unknown motions is observed by 
the stereo rig and, based on point correspondences only, the autocalibra- 
tion of the stereo rig is computed. A stratified approach is used and the 
autocalibration is computed by estimating first the epipolar geometry of 
the rig, then the plane at infinity II oo (affine calibration) and finally the 
absolute conic Q^o (Euclidean calibration). We show that the affine and 
Euclidean calibrations involve quadratic constraints and we describe an 
algorithm to solve them based on a conic intersection technique. Experi- 
ments with both synthetic and real data are used to evaluate the perfor- 
mance of the method. 



1 Introduction 



Autocalibration consists of retrieving the metric information of the cameras - 
their internal parameters and relative position and orientation - from images, 
without using special calibration objects. Additional constraints can also be 
introduced such as knowledge of some of the internal parameters of the two 
cameras (aspect ratio, image skew, ...). 

Planar autocalibration has several advantages. Planar scenes are very easy to 
process, enable very reliable point matching by fitting inter-image homographies, 
and very accurate estimation of the homographies. It will be seen that only the 
homographies are required for the autocalibration. 



Many approaches for autocalibration have been developed for monocular and 
binocular sensors in recent years. Faugeras, Luong and Maybank jS] proposed 
solving the Kruppa equations from point correspondences in 3 images. However, 
this requires non-linear solution methods. An alternative is to first recover affine 
structure and then solve for the camera calibration from this. Such a “stratified” 
approach ^ can be applied to a single camera motion jllYitlf^frij or to a stereo 
rig in motion I2ITIII2I1I and requires no knowledge of the observed scene. The 
stratified approach applied to the autocalibration of a stereo rig involves the 
computation of projective transformations of 3-D space, that is the projective 
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transformation that maps two different projective reconstructions of the same 
3-D rigid scene. Unfortunately, these projective motions cannot be estimated 
when the 3-D scene is planar so those autocalibration approaches cannot be 
used. 

Some approaches for calibration jl and autocalibration from pla- 

nar scenes have also been developed. In HZ!, the author uses the constraint that 
the projections of the circular points of a 3-D plane must lie on the image of the 
absolute conic. The proposed criteria is non-linear and the associated optimiza- 
tion process must be bootstrapped. Unfortunately no general method is given 
to obtain this bootstrapping. 

We show here that, using a stereo rig, the stratified paradigm is very well 
adapted for autocalibration from planar scenes and extend the idea developed 
in Id We prove the following results: 

(1) Ajfine calibration can be uniquely estimated from 3 views of a plane. 

(2) Euclidean calibration can be uniquely estimated from 3 views of a plane if 
at least one of the eameras of the rig has zero image skew and known aspeet 
ratio. Otherwise 4 views are required. 

2 Preliminaries 

2.1 Camera Model 

A pinhole camera projects a point M from the 3-D projective space onto a 
point m of the 2-D projective plane. This projection can be written as a 3 x 4 
homogeneous matrix P of rank equal to 3 : 



where ~ is the equality up to a scale factor. If we restrict the 3-D projective 
space to the Euclidean space, then it is well known that P can be written as : 

P = (KR Kt) 

R and t are the rotation and translation that link the camera frame to the 
3-D Euclidean one. The most general form for the matrix of internal parameters 



where a is the horizontal scale factor, a is the ratio between the vertical 
and horizontal scale factors, r is the image skew and uq and vq are the image 
coordinates of the principal point. 



m ~ PAf 



K is : 



K = 
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When the aspect ratio a is known and the image skew r is zero (i.e. the 
image axes are orthogonal), the matrix of internal parameters depends only on 
3 parameters and becomes: 



2.2 Stratified Calibration 

Autocalibration consists of recovering the metric information of the stereo rig. 
This information can be obtained through the recovery of the internal parameters 
and relative orientation and position of both cameras. 

However, once the epipolar geometry of the stereo rig has been estimated 
and a projective basis has been defined, the metric information of the rig is fully 
encapsulated by the equation of the plane at infinity ITca and the equation of 
absolute conic f2oo |T7irn] 

2.3 Notation 

In this paper we assume that the cameras of the stereo rig have constant param- 
eters under the motion, and that the rig acquired a sequence of n image pairs of 
a moving planar object. 

We denote by 77i,...,7Tfc,...,7T„ the geometric planes associated with the dif- 
ferent positions of the planar object. 

Hy (resp. Hb) denote the homographies between the left (resp. right) image 
of the stereo rig in position i and the left (resp. right) image of the stereo rig in 
position j. These 3x3 inter-image homographies can be computed from point 
correspondences . 

We also denote by Ty the geometric Euclidean transformation that maps the 
points of ITi onto the points of II j . That is, if Mi is a 3-D point of the object at 
position i and Mj the same point at position j, then these two points are related 




( 1 ) 



by Mj = rij{Mi). 



denotes the transpose of the matrix A. [.] x denotes the matrix generating 
the cross product: [a;]x y = a: A y. 



2.4 Organization of the Paper 



The remainder of the paper is organized as follow. In Section 3, we explain 
how the epipolar geometry can easily be estimated from a sequence of image 
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pairs of a planar object. In Section 4, the affine autocalibration is described and 
we show how the equation of the plane at infinity Uoo can be estimated. The 
Euclidean autocalibration (estimation of the absolute conic fl^o ) is performed 
in Section 5. Section 6 shows some experiments with synthetic and real data in 
order to demonstrate the stability of the approach. Finally a brief discussion is 
given in Section 7. 



3 Projective Calibration 

The projective calibration consists of estimating the epipolar geometry of the 
stereo rig. The epipolar geometry is assumed to be constant and can therefore 
be computed from many image pairs. 

It is well known that the epipolar geometry cannot be estimated from a single 
image pair of a 3-D planar scene. However when the planar scene performs mo- 
tions, all the image pairs (each corresponding to a different position of the planar 
scene) gathered by the stereo rig can be used and this makes the computation 
of the epipolar geometry possible. 

The motions of the plane must be chosen so that they do not correspond 
to critical motions US! These are motions which are not sufficient to enable 
the epipolar geometry to be computed uniquely. In this case they are transla- 
tions parallel to the plane of the scene, rotations orthogonal to the plane of the 
scene and combinations of the two. The plane is effectively fixed (as a set, not 
pointwise) relative to the rig under these motions. 

The fundamental matrix F associated with the stereo rig is computed from all 
the left-to-right point correspondences from all the image pairs using a standard 
technique m- The projection matrices P and associated with the left and 
right cameras respectively can then be derived |E|. Without loss of generality 
these two 3x4 matrices can be written as: 



where I is the 3x3 identity matrix, P' is a 3 x 3 matrix and p’ a 3- vector. 

Using point correspondences it is therefore possible to obtain a projective 
reconstruction of the points of the planes. It is also possible to estimate the 
projective coordinates 7ri,...,7rfe,...,7r„ of the planes 7Ti,...,77fc,...,iT„ associated 
with the different positions of the planar object. In the following, is the 
4- vector: 



P (I 0) 



P' ~ (P' p') 



( 2 ) 




( 3 ) 



where is a 3-vector and a real number. 
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Fig. 1. The geometry of lines and planes involved in the affine autocalibration. The 
image line h is the vanishing line of the plane Ui, which is the image of Li. 



4 AfRne Calibration 

This section describes the affine autocalibration, which consists of estimating, in 
the projective basis determined previously ®, the coordinates tv^o of the plane 
at infinity 77oo • For this purpose we use here the vanishing line of the observed 
plane in each left view, and show how quadratic constraints on the coordinates 
of this vanishing line can be derived. 

We will use the fact that Uoo is a particular plane: it is the only plane of 
projective space that remains globally invariant under any affine transformation, 
i.e. under the action of any affine transformation, any point lying on Uoa has its 
image lying on II as well. 

Let Li,...,Lfe,...,L„ be the 3-D lines corresponding to the intersections of TJoo 
with 7Ti,...,7Tfc,...,77„ respectively, (see Figure Pi. We use the following result: 

Proposition 1. Consider any two lines Li and Lj among Li,...,L„. being 
the Euclidean transformation that maps Hi onto II j as defined in Section m 
we have: 

Lj = rij{Li). 
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Proof: The intersection of two planes is preserved by a Euclidean transformation 
(or indeed a projective transformation). However, a Euclidean transformation 
has the additional property that U^o is fixed (as a set, not pointwise). Therefore, 
Li (on Ilao) is mapped to Lj (on TToo). In our notation this is written: 



(L,) = Ty (TToo n n,) 

= r,j{n^) n Ty ( tt.) 

= TToo n Hj 

= L, 



□ 



This proves that, for all fc, 1 < A: < n, Lk is the same line of the planar 
object in the different positions of the object, namely the line at infinity on the 
scene plane. An important feature of the lines Li,...,Lfe,...,L„ is that they are all 
contained in the plane II oo and therefore are coplanar. This provides a constraint 
that will be used to solve for tTqo . In fact we actually solve for the vanishing line 
lk of each plane Ilk and parameterize the solution by li. 

Let ifc be the vanishing line of Ilk which is the image of Lk in the left camera 
(see Figure [Q. Let “Ik be the 3-D plane going through Lk and the optical centre 
C of the left camera. The plane <l>k also intersects the left image plane at 
and it can easily be shown that, in the projective basis defined in Section |21 the 
equation of <Pk is 4>f. — h- With P = (I 0) we have cf)). = (iJ. 0)^. 

Lk can be regarded as the intersection of Ilk and I>k- Ilk and L>k define 
a pencil of planes that contains Lk, and LIoo is in this pencil. Liao is therefore 
common to all pencils {IIk,’Pk)- In other words, there exist some reals Ai,A 2 ,...,A„ 
and such that for all fc: 

^oo — AfcTTfc -t“ ^k^k (^) 



Combining equation for two pencils of planes (7Ji,<?i) and {IIj,(Ij) we 
obtain the constraint corresponding to the coplanarity on Liao of two lines Li 
and Lj-. 

XlTTi + H^4>i = XjTTj + (5) 



Equation © means that tt^, cj}^, TVj and <pj are linearly dependent and there- 
fore is equivalent to det(7Ti, tt^, = 0. Using (|3) for tt^, the condition for 

two lines Li and Lj being coplanar becomes: 



7Z i 7Z j li lj 

tti aj 0 0 



( 6 ) 



The lines l\,...,lk,- -,ln represent the corresponding vanishing lines of the 
plane in the different images. Since all lk are images of Li on TTi, we have: 

h = n-^h 



( 7 ) 
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We can therefore express all the lines l 2 ,...,ln with respect to h. Expanding 
the determinant we obtain the following quadratic equation: 

ljc:,h = 0 ( 8 ) 

Ai+Ah 

where C* is a 3x3 symmetric matrix such that C*j = — — — and Ay is a 
3x3 matrix defined by Ay = VL[^[ajTti — . 

The coplanarity of Li and Lj therefore defines a quadratic constraint on Zi. 
Once l\ is estimated, the lines are estimated from o, and the equations 

of the planes as well. 

We will see that only the lines Zi,...,Z„ are required for Euclidean autocali- 
bration. However tVoo can also be estimated. tVoo is computed as the common 
plane to all pencils of planes In practice, tToo is computed by solving 

the linear system defined by equations 0) where the unknowns are and the 
reals Ai,...,A„ and For n positions, this linear system has 2n -|- 4 un- 

knowns (n A’s, n /r’s and 4 for tToo) and 4n equations, and these can be solved 
using an SVD approach. To conclude: 



— with 2 views of the planar object, we obtain a single constraint C*2 and 
there is a one-parameter family of solutions for Zi (all the lines of the conic 
^^12)- Therefore there is a one-parameter family of solutions for tToq ; 

— with 3 views of the planar object, we obtain 3 independent constraints C*2, 
Cj^3 and C23, and l\ corresponds to the common intersection of these conics. 
The solution of the equations Q can be found in Annex El TToo is thus 
determined uniquely. 



5 Euclidean Calibration 

Let fioo be the absolute conic and uica and its projection onto the left and 
right camera respectively. A fundamental property of flea , <^00 and ijj'^ is that 
they are all invariant to Euclidean transformations (provided that the internal 
parameters of the cameras are constant). Euclidean autocalibration consists of 
estimating the coordinates of (loo ■ It is also equivalent, given II 00 , to estimating 
the equation of one of the projections of floe ■ We can choose, for instance, to 
estimate its left projection u)oo whose expression is u)oo = (KK^)~^ where K 
is the matrix of internal parameters of the left camera. 

Consider the (complex) circular points Ifc and Ifc of the plane Ilk- By def- 
inition Xfc and Tk are the intersections of Ilk with floe and therefore are also 
the intersections of Ik with floe - Let Ik and Ik be the projections of Tk and 
Tk onto the left camera. As a consequence, Ik and Ik are the intersections of Ik 
and Uoo ■ Solving for a; 00 then consists of the following steps: 
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Fig. 2. Tli0 circular points lie on the absolute conic 



1. Use the constraint that the points Ik and Ik lie on the lines Ik estimated by 
the affine autocalibration; 

2. Express the constraint that all Ik and Ik lie on the same conic u)cx>, 

3. Estimate from all Ik and Ik', 

4. Compute K from lJcx> ■ 

Let Pi and Qi be two real points lying on Zi. Ji can be parameterized by 
a complex A such that Ii = Qi + Xpi- As all Ik and Ik belong to the planar 
object, they are related by the inter-image homographies Hy and therefore we 
have for all k: 

Ik — ^Ikll — ^IkQi AHi^cPi /Q\ 

Ik — ^Ikll — ^IkQi 

A constraint can be expressed on A that all points Ik and Ik lie on the same 
conic u)oo- We will consider first the case of unrestricted K. 



5.1 General Calibration K 

Consider any 3 positions of the planar object associated with the planes II t, II j 
and Ilk and the projections Ii, Ii, Ij, Ij, Ik and Ik of their circular points onto 
the left camera. 
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Let Yij, Y ik and Y jk be the respective intersections of the lines (lilj) and 
(Tilj), (lilk) and (lilk), (Ijlk) and (Ijlk)- One can show that the expression 
of Yij is: 

^ ^j) ^ ^ ^i) 

— AijU + BijV + Cij 

where Aij, Bij and Cij are three reals depending only on the entries of Pi, Qi, 
Hii and Hij, and u and v are two real numbers such that u = XX and u = A + A. 




Fig. 3. Pascal’s theorem : condition for 6 points to lie on a conic. 



From Pascal’s theorem, the six points li, li, Ij, Ij, Ik and Ik lie on the 
same conic if and only if Y jk and Y ij lie on a line (see Figure E) • This can 

be expressed as: 

det(y,fe,y,fc,yy) = 0 (ii) 

Using the expression obtained in m for Yij it is clear that m - and 
therefore the constraint that the points li^ li, Ij, Ij, Ik and Ik are on a conic 
- is a cubic equation in u and v: 



N<3m<N 

C,k{u,v) =Y. Y. ( 12 ) 

m— 0 

where ^m,N-m are some real numbers depending only on the entries of Pi, q^, 
Hio and Hxfc- 

From 4 views, it is therefore possible to obtain 4 cubic constraints Bijk(u, v) 
such as (1 1 21 . Solving simultaneously these cubic constraints ng gives a solution 
for (u, v) from which A and hence Wtx) may be computed. 
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5.2 Zero Skew, Known Aspect Ratio 

In the case of a 3-parameter projective camera as described by the model o, 
skew is zero and the aspect ratio a is known. These constraints can be imposed 
by introducing two complex points J and J such that J = (1 ai 0)^ and 
J — {1 — ai . Then if skew is zero and the aspect ratio a is known J and J 
lie on (the intersection of ujoc with the line at infinity in the image). 

The same approach as in the general case described above can be used. Using 
any two positions i and j of the planar object, a constraint derived from Pascal’s 
theorem can be expressed that the 6 points It, It, Ij, Ij, J and J lie on the 
same conic uSoo ■ Including J and J reduces the number of views required to 
solve for . In this case the constraint m has the form: 

(A — X)‘^x^Q^jX = 0 

where a; is a real 3- vector such that x = (AA, A-l- A, 1) and is a 3 x 3 symmetric 
matrix that depends only on the entries of p^, q^, Hii, Hy and the aspect ratio 
a. As A is a non-real complex number, then A yf A and the constraint reduces to: 

= 0 (13) 

Then from two views we obtain a quadratic constraint on x. From 3 views 
or more, we obtain therefore at least 3 independent conics Q^- corresponding to 
the quadratic constraints m- The intersection of these conics gives, when the 
motions of the planar object are general, a unique solution for x. 

Once X is computed (see details in Annex EJ, A is known and then all the 
points Ik, Ik can be estimated as well. can then be computed as the conic 
going through all the points Ik, Ik and J and J. 

Finally K is estimated by the Cholesky decomposition of Woo = (KK^)"^. 



5.3 Summary of the Autocalibration Algorithm 

The complete algorithm can be summarized as follow: 

1. Compute the fundamental matrix F and the projective coordinates 7Ti,...,7r„ 
of the planes 7Ti,...,7T„; 

2. Estimate the inter-image homographies Hy; 

3. Affine autocalibration: solve the quadratic constraints 0 for li; 

4. Euclidean autocalibration: solve the quadratic constraints (inj. Compute A, 
Ik and Ik with ( 0 ). Then compute Uoc as the conic going through all Ik 
and Ik and finally compute K by Cholesky decomposition; 

5. Bundle adjustment (optional): minimization of point backprojection errors 
onto the left and right cameras of the 3-D planar scene at its different loca- 
tions. 
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6 Experiments 

The stereo autocalibration algorithm has been implemented in matlab and ap- 
plied to both synthetic and real data. 



6.1 Synthetic Data 





Fig. 4. Errors in the estimation of focal length (in %) and of principal point (in pix.) 
vs. level of noise. 




Fig. 5. Errors in the estimation of focal length (in %) and of principal point (in pix.) 
vs. number of image pairs. 



Experiments with simulated data are carried out in order to assess the sta- 
bility of the method against measurement noise. 

A synthetic 3-D planar scene consisting of 100 points is generated and placed 
at different locations in 3-D space. The 3-D points of each position are projected 
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onto the cameras of a stereo rig and Gaussian noise with varying standard de- 
viation cr (from 0.0 to 1.0 pixel) is added to the image point locations. The 
cameras have a nominal focal length / of 1200 pixels, unit aspect ratio and zero 
image skew and the image size is 512 x 512. Image point locations are normal- 
ized as described in 0 and inter-image homographies are estimated. The 
autocalibration is then computed 100 times for each a. 

Figure 0 shows the resulting accuracy with varying noise and 7 image pairs. 
Figure 0 shows the resulting accuracy with a fixed noise level of 0.7 pix. and a 
varying number of image pairs. 

The experiments show that the estimation provided by the method is quite 
accurate. Even for a level of noise of 1.0 pix., the error in the estimation of the 
focal length is less than 2.5%. Moreover the approach gives sufficiently stable 
and accurate results to initialize a bundle adjustment procedure. With such 
a procedure, the accuracy of the estimation of both the focal length and the 
location of the principal point is increased as shown in Figures 0 and 0 



6.2 Real Data 




Fig. 6. One of the seven pairs gathered by the stereo rig 



We gathered 7 image pairs of a planar scene (see Figure El with a stereo rig. 
Thirty points are matched between all images and the autocalibration algorithm 
applied using 4 to 7 image pairs from the whole sequence. In order to show 
the efficiency of the method, we show results before and after applying the 
bundle-adjustment procedure. The results are shown in FigureQwhere they are 
compared with the results of an off-line calibration 0. 

As the number of views increases the estimated values approach ground truth. 
Although we used few points and all matches were made by hand, the method 
gives acceptable results. The bundle-adjustment procedure, initialized with these 
results, provides accurate enough calibration for metric reconstruction purposes. 
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Fig. 7. Results of autocalibration algorithm for the real data of Figure^lusing different 
numbers of image pairs and off-line calibration. 



7 Conclusion 

We describe in this paper a new method for autocalibrating a stereo rig from 
several views of a plane. 

We show that the epipolar geometry of the rig can easily be estimated with 
a planar scene in motion. We use the constraint that the projections of the cir- 
cular points of a 3-D plane must lie on the image of the absolute conic. Then the 
autocalibration is performed by applying a stratified approach. Both autocali- 
bration steps -affine and Euclidean- involve a set of quadratic constraints and 
we therefore designed a conic intersection method to solve for them. 

Futhermore, our approach provides an algebraic solution (i.t. non-iterative) 
to Trigg’s planar method when vanishing lines are known, and this could be 
used for autocalibrating a camera from a monocular sequence of planes. 



A Intersection of Conics 

Let Ci,...,Cfe,...,C„ be n conics (n > 3) represented as 3 x 3 matrices. Let us 
suppose that these conics have a common intersection x. For each k we have: 

CkX = 0 

Consider any two conics and Cj. Let vq be a real number such that 
det(Ci -b r'oCj) = 0 (t'o always exists because v det(Ci -b I'Cj) is a degree- 
three polynomial with real factors). Let D^- be such that Dy = Ci + voGj. Then 
Dij belongs to the pencil of conics generated by and Cj and is degenerate 
(det(Dy) = 0). Moreover x belongs to D^- because: 

x^GijX = x^ {Ci + h'oCj)x 

= x^ CiX -b vqX^ C jX = 0 



638 



D. Demirdjian, A. Zisserman, and R. Horaud 




Fig. 8. Intersection of conics 



As a degenerate conic, Dy is the union of two lines Aij and Z\'^ and therefore 
X lies on one (at least) of these two lines. As a consequence, x can be estimated 
as the common intersection of all the pairs of lines Ay and Ab . 

Therefore the method we propose for solving simultaneously the quadratic 
constraints defined by the matrices consists of the following steps: 

— compute the degenerate conics Dy and their associated pairs of lines Ay 
and Ab. In practice, it is not necessary to compute all the possible Dy, we 
can choose to compute only n of them; 

— intersect the pairs of lines Ay and Ab, that is, find a point x such that it 
belongs to one line at least of each pair of lines Ay and Ab. It is worth 
noticing that when data are noisy, the lines do not exactly intersect at the 
same point and an approach similar to linear least squares can be used to 
find the closest point x to all pairs of lines. 
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Abstract. In this paper, we show that it is possible to calibrate a ca- 
mera using just a flat, textureless Lambertian surface and constant il- 
lumination. This is done using the effects of off-axis illumination and 
vignetting, which result in reduction of light into the camera at off-axis 
angles. We use these imperfections to our advantage. The intrinsic pa- 
rameters that we consider are the focal length, principal point, aspect 
ratio, and skew. We also consider the effect of the tilt of the camera. 
Preliminary results from simulated and real experiments show that the 
focal length can be recovered relatively robustly under certain conditions. 



1 Introduction 

One of the most common activities prior to using the camera for computer vision 
analysis is camera calibration. Many applications require reasonable estimates of 
camera parameters, especially those that involve structure and motion recovery. 
However, there are applications that may not need accurate parameters, such 
as those that only require relative depths, or for certain kinds of image-based 
rendering (e.g., Having ballpark figures on camera parameters would be 
useful but not critical. 

We present a camera calibration technique that requires only a flat, texture- 
less surface (a blank piece of paper, for example) and uniform illumination. The 
interesting fact is that we use the camera optical and physical shortcomings to 
extract camera parameters, at least in theory. 



1.1 Previous Work 

There is a plethora of prior work on camera calibration, and they can be roughly 
classified as weak, semi-strong and strong calibration techniques. This section 
is not intended to present a comprehensive survey of calibration work, but to 
provide some background in the area as a means for comparison with our work. 

Strong calibration techniques recover all the camera parameters necessary for 
correct Euclidean (or scaled Euclidean) structure recovery from images. Many 
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of such techniques require a specific calibration pattern with known exact di- 
mensions. Photogrammetry methods usually rely on using known calibration 
points or structures 0S|. Brown |3, for example, uses plumb lines to recover 
distortion parameters. Tsai uses corners of regularly spaced boxes of known 
dimensions for full camera calibration. Stein m uses point correspondences 
between multiple views of a camera that is rotated a full circle to extract intrin- 
sic camera parameters very accurately. There are also proposed self-calibration 
techniques such as mnm- 

Weak calibration techniques recover a subset of camera parameters that will 
enable only projective structure recovery through the fundamental matrix. Fau- 
geras’ work |3] opened the door to this category of techniques. There are nume- 
rous other players in this field, such as |41 1 2] . 

Semi-strong calibration falls between strong and weak calibration; it allows 
structures that are close to Euclidean under certain conditions to be recovered. 
Affine (e.g., 0) calibration falls into this category. In addition, techniques that 
assume some subset of camera parameters to be known also fall into this category. 
By this definition, Longuet-Higgins’ pioneering work jOj falls into this category. 
This category also includes Hartley’s work 0 on recovering camera focal lengths 
corresponding to two views with the assumption that all other camera intrinsics 
are known. 

The common thread of all these calibration methods is that they require 
some form of image feature, or registration between multiple images, in order to 
extract camera parameters. There are none that we are aware of that attempts 
to recover camera parameters from a single image of a flat, textureless surface. 
In theory, our method falls into the strong calibration category. 

1.2 Outline of Paper 

We first present our derivation to account for off-axis camera effects that include 
off-axis illumination, vignetting, and camera tilt. We then present the results 
of our simulation tests as well as experiments with real images. Subsequently, 
we discuss the characteristics of our proposed method and opportunities for 
improvement before presenting concluding remarks. 



2 Off-axis Camera Effects 

The main simplifying assumptions made are the following: (1) entrance and exit 
pupils are circular, (2) vignetting effect is small compared to off-axis illumina- 
tion effect, (3) surface properties of paper are constant throughout and can be 
approximated as a Lambertian source, (4) illumination is constant throughout 
(absolutely no shadows), and (5) a linear relation between grey level response of 
the CCD pixels and incident power is assumed. We are also ignoring the camera 
radial and tangential distortions. In this section, we describe three factors that 
result in change of pixel intensity distribution: off-axis illumination, vignetting, 
and camera tilt. 
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2.1 Off-axis Illumination 



If the object were a plane of uniform brightness exactly perpendicular to the 
optical axis, the illuminance of its image can be observed to fall off with distance 
away from the image center (to more precise, the principal point). It can be 
shown that the image illumination varies across the field of view in proportion 
with the fourth power of the cosine of the field angle (see, for example, f7lll)ll4j i. 
We can make use of this fact to derive the variation of intensity as a function of 
distance from the on-axis projection. For completeness, we derive the relationship 
from first principles. 





Fig. 1. Projection of areas: (a) On-axis, (b) Off-axis at entrance angle 9 . Note that the 
unshaded ellipses on the right sides represent the lens for the imaging plane. 
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The illuminance on-axis (for the case shown in Figure E^a)) at the image 
point indicated by dA' is 

L is the radiance of the source at dA, i.e., the emitted flux per unit solid angle, 
per unit projected area of the source. S is the area of the pupil normal to the 
optical axis, M is the magnification, and R is the distance of dA to the entrance 
lens. The flux is related to the illuminance by the equation 



I' = — 

dA' 

Now, the flux for the on-axis case (Figure [Ha)) is 

LdAS 



d^o = 



i?2 



However, the flux for the off-axis case (Figure 0^a)) is 

L{dA cos 9) {S cos 9) 



d<P = 



{R/ COSd)2 



= cos" 9 = dA'^^^ cos" 9 

since dA' = M'^dA. 

As a result, the illuminance at the off-axis image point will be 

/'(6») = /' cos" 6» 



(2) 



( 3 ) 



( 4 ) 



( 5 ) 



If / is the effective focal length and the area dA' is at image position (u, v) 
relative to the principal point, then 



R{9) = /' 



/ 



.VP 



( 6 ) 



^°(l + (r//)2)2 



= PI'o 



where + v'^. 



2.2 Vignetting Effect 

The off-axis behaviour of attenuation is optical in nature, and is the result of 
the intrinsic optical construction and design. In contrast, vignetting is caused by 
partial obstruction of light from the object space to image space. The obstruction 
occurs because the cone of light rays from an off-axis source to the entrance pupil 
may be partially cut off by the held stop or by other stops or lens rim in the 
system. 
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Fig. 2. Geometry involved in vignetting. Note that here there exists a physical stop. 
Contrast this with Figure^ where only the effect of off-axis illumination is considered. 

The geometry of vignetting can be seen in Figure 0 Here, the object space is 
to the left while the image space is to the right. The loss of light due to vignetting 
can be expressed as the approximation (see HD, pg. 346) 



This is a reasonable assumption if the off-axis angle is small. In reality, the 
expression is significantly more complicated in that it involves several other 
unknowns. This is especially so if we take into account the fact that the off-axis 
projection of the lens rim is elliptical and the original radius on-axis projection 
has a radius different from that of G 2 in Figure 0 

2.3 Tilting the Camera 

Since the center of rotation can be chosen arbitrarily, we use a tilt axis in a plane 
parallel to the image plane at an angle \ with respect to the x-axis (Figure OJ- 
The tilt angle is denoted by r. The normal to the tilted object sheet can be 
easily shown to be 



/b (0)«(l-ar)/'(0) 



( 7 ) 



n,- = (sinysinr, — cosy sin r, cost)"*". 
The ray that pass through (u, v) has a unit vector 



( 8 ) 




( 9 ) 



Calibrate Using Flat Textureless Lambertian Surface? 



645 




Fig. 3. Tilt parameters x and r. The rotation axis lies on a plane parallel to the image 
plane. 



The foreshortening effect is thus 



Ag ■ At- = COS 0 COS T 




tan r 

— ^(usinx 



V cos x) 



( 10 ) 



There are two changes to and hence as a result of the tilt: 

— Foreshortening effect on local object area dA, where dA cos 9 is replaced by 
dA{Ag ■ n^) 

— Distance to lens, where (i?/cos0)^ is replaced by {R/{Ag ■ Ar/ cost)Y 
This is computed based on the following reasoning: The equation of the tilted 
object plane, originally R distance away from the center of projection, is 



p • At = (0, 0, R)^ ■ Ar = Rcost 



( 11 ) 



The image point (u,v), whose unit vector in space is ng, is the projection 
of the point RrUg, where Rt is the distance of the 3-D point to the point of 
projection. Substituting into the plane equation, we get 



R cos T 

ng ■ Ht 

Incorporating these changes to 0, we get 



( 12 ) 



ng ■ Ht 



cos 9 



I' (9) = /'(n, . At) 



COST 



2 



(13) 
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tan r 




3 



= /qCOSt H — 

= /o7Cos‘‘6» = 7^7/3 



from (0. 



2.4 Putting it All Together 

Combining ED and (0, we have 



7'ii(0) = /'(! - arhP 



(14) 



We also have to take into consideration the other camera intrinsic parameters, 
namely the principal point {px,Py), the aspect ratio a, and the skew s. {px,Py) is 
specified relative to the center of the image. If (uorig, i^orig) is the original image 
location relative to the camera image center, then we have 



Another variant of this objective function we could have used is the least median 
squared metric. 

3 Experimental Results 

The algorithm implemented to recover both the camera parameters and the off- 
axis attenuation effects is based on the downhill Nelder-Mead simplex method. 
While it may not be efficient computationally, it is compact and very simple to 
implement. 

3.1 Simulations 

The effects of off-axis illumination and vignetting are shown in Figure 0 and O 
As can be seen, the drop-off in pixel intensity can be dramatic for short focal 
lengths (or wide fields of view) and significant vignetting effect. Our algorithm 
depends on the dynamic range of pixel variation for calibration, which means 
that it will not work with cameras with a very small field of view. 

There is no easy way of displaying the sensitivity of all the camera parameters 
to intensity noise (t„ and the original maximum intensity level Iq (as in (1 1 411 V 
In our simulation experiments, we ran 50 runs for each value of and Iq. 
In each run we randomize the values of the camera parameters, synthetically 




(15) 



The objective function that we would like to minimize is thus 




(16) 
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(a) (b) (c) 



Fig. 4. Effects of small focal lengths (large off-axis illumination effects) and vignetting: 
(a) image with / = 500, (b) image with / = 250, and (c) image with / = 500 and 
a — 1.0~®. The size of each image is 240 x 256. 
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Fig. 5. Profiles of images (horizontally across the image center) with various focal 
lengths and vignetting effects: (a) varying /, (b) varying a (at /=500). 
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generate the appearance of the image, and use our algorithm to recover the 
camera parameters. Figure 0 shows the graphs of errors in the focal length /, 
location of the principal point p, and the aspect ratio a. As can be seen, / and a 
are stable under varying cr„ and /q, while the error in p generally increases with 
increasing intensity noise. The error in p is computed relative to the image size. 









1_0 = 255 
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0.0 0.5 1.0 1.5 2.0 



Intensity noise 
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Fig. 6. Graphs of errors in selected camera parameters across different maximum inten- 
sity Iq and intensity errors: (a) focal length /, (b) principal point location, (c) absolute 
error in aspect ratio. 



3.2 Experiments with Real Images 

We also used our algorithm on real images taken using two cameras, namely 
the Sony Mavica FD-91 and the Sharp Viewcam VL-E47. We conducted our 
experiments by first taking a picture of a known calibration pattern and then 
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taking another picture of a blank paper in place of the pattern at the same 
camera pose. The calibration pattern is used to extract camera parameters as a 
means of “ground truth.” Here, calibration is done using Tsai’s algorithm m- 
Note that the experiments were conducted under normal conditions that are not 
highly controlled. 

The results are mixed: The focal length estimated using our proposed tech- 
nique range from 6% to 50% of the value recovered using Tsai’s calibration 
technique. The results tend to be better for images taken at wider angles (and 
hence more pronounced off-axis illumination dropoff effects) . It is also interesting 
to find that the focal length estimated using our method is consistently unde- 
restimated compared to that estimated using Tsai’s algorithm. What is almost 
universal, however, is that the estimation of the principal point and camera tilt 
using our method is unpredictable and quite often far from the recovered “gro- 
und truth.” However, we should note that Tsai’s calibration method for a single 
plane does not produce a stable value for the principal point when the calibration 
plane is close to being fronto-parallel with respect to the camera. 




Fig. 7 . Two real examples: images of calibration pattern (a,c) and their respective 
“blank” images (b,d). The image size for (a,b) is 512 x 384 while that for (c,d) is 
640 X 486. 



In this paper, we describe two of the experiments with real images. In experi- 
ment 1, the images in Figure I3a,b) were taken with the Sony Mavica camera. In 
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experiment 2, the images in Figure 0c, d) were taken with the Sharp Viewcam 
camera. Notice that the intensity variation in (d) is much less than that of (b). 
Tables □ and 121 summarize the results for these experiments. Note that we have 
converted Tsai’s Euler representation to ours for comparison. Our values are 
quite different from those of Tsai’s. There seems to be some confusion between 
the location of the principal point and the tilt parameters. 





Ours 


Tsai’s 


/ (pixels) 


1389.0 


1488.9 


K, 


— 


3.56 X 10"“ 


a 


0.951 


1.0 


p 


(-4.5, 18.8) 


(37.8, 14.7) 


X 


1.8° 


2.1° 


T 


1 

o 

CO 

0 


-40.0° 



Table 1. Comparison between results from our method and Tsai’s calibration for 
Experiment 1. k is the radial distortion factor, p is the principal point, a is the aspect 
ratio, and y and r are the two angle associated with the camera tilt. 





Ours 


Tsai’s 


/ (pixels) 


2702.9 


3393.0 


n 


— 


-4.51 X 10~“ 


a 


1.061 


1.0 


p 


(-79.9, -56.9) 


(-68.3, -34.6) 


X 


1.6° 


17.1° 


T 


-0.2° 


-9.1° 



Table 2. Comparison between results from our method and Tsai’s calibration for 
Experiment 2. k is the radial distortion factor. Note that in this instance, the calibration 
plane is almost parallel to the imaging plane. 



4 Discussion 

In our work, we ignored the effect of radial distortion. This is for the obvious 
reason that its radial behaviour can misguide the recovery of off-axis drop-off 
parameters, which have radial behaviour as well. In addition, shadows, and pos- 
sibly interreflection, will have a deleterious result on our algorithm. As a result, 
it is easy to introduce unwanted and unmodeled effects in the image acquisition 
process. 
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Fig. 8. Actual and fit profiles for the examples shown in Figure Q (a) corresponds to 
FigureQ(a,b) while (b) corresponds to Figure rTc.d). 



The dynamic range of intensities is also important in our algorithm; this is 
basically a signal-to-noise issue. It is important because of the intrinsic errors of 
pixel intensity due to the digitization process. In a related issue, our algorithm 
works more reliably for wide-angled cameras, where the off-axis illumination and 
vignetting effects are more pronounced. This results in a wider dynamic range 
of intensities. One problem that we have faced in our experiments with real 
images is that one of our cameras used (specifically the Sony Viewcam) has the 
auto-iris feature, which has the unfortunate effect of globally dimming the image 
intensities. 

Another unanticipated issue is that if paper is used and the camera is zoomed 
in too significantly, the fiber of the paper becomes visible, which adds to the 
texture in the resulting image. It is also difficult to have uniform illumination 
under normal, non-laboratory conditions. 

On the algorithmic side, it appears that it is relatively easy to converge on a 
local minimum. However, if the data fit is good, the results are usually close to the 
values from Tsai’s calibration method, which validates our model. We should also 
add that the value of the principal point cannot be stably recovered using Tsai’s 
single-plane calibration method when the calibration plane is close to being 
fronto-parallel with respect to the camera. Our use of a simplified vignetting term 
may have contributed significantly to the error in camera parameter recovery. 

We do admit that our calibration technique, in its current form, may not be 
practical. However, the picture may be radically different if images were taken 
under much stricter controls. This is one possible future direction that we can 
undertake, in addition to reformulating the vignetting term. 
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5 Conclusions 

We have described a calibration technique that uses only the image of a flat 
textureless surface under uniform illumination. This technique takes advantage 
of the off-axis illumination drop-off behaviour of the camera. Simulations have 
shown that both the focal length and aspect ratio are robust to intensity noise 
and original maximum intensity. Unfortunately, in practice, under normal con- 
ditions, it is not easy to extract highly accurate camera parameters from real 
images. Under our current implementation, it merely provides a ballpark figure 
of the focal length. We do not expect our technique to be a standard technique 
to recover camera parameters accurately; there are many other techniques for 
that. What we have shown is that in theory, camera calibration using flat tex- 
tureless surface under uniform illumination is possible, and that in practice, a 
reasonable value of focal length can be extracted. It would be interesting to see if 
significantly better results can be extracted under strictly controlled conditions. 
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Abstract. This article deals with optical laws that must be considered 
when using underwater cameras. Both theoretical and experimental point 
of views are described, and it is shown that relationships between air and 
water calibration can be found. 



1 Introduction 

Use of vision systems in media in which the wave speed propagation is not that 
one of air is a subject seldom treated in the Vision community. However, any 
trial to localize or reconstruct an object observed by an underwater camera (for 
instance) has to go through a calibration phase. 

This article presents some optical considerations relating to underwater ca- 
meras. 

We show the relationship between the current pin-hole model of the camera 
and the general optical model of the lens combination for the same camera in 
air and under water. 

The relations found are verified in simulation and by experiments. We prove 
that the calibration of a camera working under water does not have to be carried 
out under water. 

The intrinsic parameters of a camera immersed in any fluid can be computed 
from an air-calibration as soon as the optical surface between the two fluids 
presents some simple geometrical properties. 



2 Optics 

The classical model used in artificial vision for description of image formation is 
the perspective projection and thus the pine-hole (or stenope) model of projec- 
tion. 

Some links between this model and classical optical laws were established in 
0, but the object and its image were both in the same homogeneous medium, 
namely the air. 

Underwater camera calibration must involve a slightly more general optical 
model, taking account of the different fluids in which the object and the image 
are situated . 
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eb 




rl = Radius of curvature of the first surface 
r2 = Radius of curvature of the last surface 
m = magnification = (pi/po) 

po = Object distance to the left of the Principal Point (AoHo) 
pi= Image distance to the right of the Principal Point (HiAi) 
ec = center thickness 
eb = edge thickness 
f = Effective focal length 



Fig. 1. Classical thick model, in an homogeneous fluid 



2.1 Prerequisites 

— Conjugate planes: if an optical system makes the rays from an object point 
Ao converge to a point Ai, then Ai is said to be the image or equivalently 
the conjugate of A^. 

— Transversal magnification: (Gt). If xao and XAi are the respective di- 
stances of points Ao and Ai to the optical axis, the transversal magnification 
Gt is equal to the ratio of these distances: 



^ Ao 

— Angular magnification: (Go). Angular magnification denotes the ratio of 
the incident and emergent angles (uq, Ui) of an optical ray going through two 
conjugate points of the optical axis. 



Ga=(^),=o.y=o 

Ho 
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— Principal planes and points: object and image principal planes are con- 
jugate planes orthogonal to the optical axis such that the transversal ma- 
gnification is one. This definition implies that the rays between these two 
planes are parallel to the optical axis. 

Principal points are the intersection of the principal planes with the optical 
axis. 

— Focal length: fi. This is the distance {HiFi) where Fj is the image focal 
point. We recall that the image of an object located at infinity undergoes no 
blurring at the focal point. 

— Nodal points: These are the pair of conjugate points on the optical axis 
No et Ni such that every ray through No emerges at W without change of 
direction (i.e. the angular magnification is one). 

2.2 Thick Model, for Two Different Homogeneons Fluids 

Most vision applications deal with a camera immersed in an homogeneous fluid, 
namely air. Under such an hypothesis some simplifications arise and it can be 
shown ^ that nodal points and principal points coincide. The use of a pin-hole 
model consists in merging the two principal planes in order to only retain rays 
through the equivalent optical center. 

Paraxial formulas for a lens located between two distinct homogeneous fluids 
are found in most handbooks of geometrical optics jZj. We recall the expressions 
that will be used hereafter. 

These formulas extend the properties of the lenses to the arbitrary refractive 
index of the object (ni) and of the image ( 712 ) media, also involving mechanical 
specification of the lenses in which glass has an index equals to n: 

For opticians, refractive index is the ratio of the speed of light in air and 
the speed of light in the considered medium. When the situation involves two 
different extremal indices, the focal length / has two distinct values fo for the 
object medium and fi for the image medium. Moreover nodal and principal 
points are now distinct. 

The following relations hold between the different optical variables (C.f. Fi- 
gure 12 . 21 . 

1. Lens constant 




( 1 ) 



2. Focal lengths: 



k 



( 2 ) 



3. Gauss relation 



Til ri2 



+ '^ = k 



Pi P2 



( 3 ) 
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Fig. 2. Optical ways in fluids of different indices 



4. Principal points locations 



EHo 



rii-tc {u 2 — n) 
k n.T2 



( 4 ) 



SH,= 



—n 2 -tc {n — Til) 
k n.ri 



( 5 ) 



5. Nodal points locations 



ENo = EH, + H,No 



( 6 ) 



with 



SN, = SH, + H,N, 



H,N, = H,N, = 



(n2 — nl) 
k 



( 7 ) 



(8) 



2.3 Entry Surface Properties 

For most applications involving an underwater camera, the lens system is set 
to be focused at infinity. This allows a focused image to be obtained from an 
infinite distance, up to a few centimeters of the entry surface (the minimum 
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focused distance decreases with short focal lengths). The above consideration 
implies that the photosensitive matrix is almost always located at the image 
focal point. 

Underwater cameras often possess an entry surface where the external surface 
is a plane. This property can be explained by the necessity of obtaining focused 
images in water as well as in air conditions. 

In fact, an image will remain focused independently of the object medium 
index if and only if its image focal point remains unchanged through index 
variation. The image focal point is determined by the distance SFi (where S 
denotes the exit surface of the last lens of the optical system) . 



SFi = SHi + HiFi = SHi + fi 



—ri2.tc (n — ni) 
k n.ri 




(9) 



When ri grows to infinity (which is equivalent to obtaining a plane surface 
at the air/ water interface), the expression becomes independent of ni: 



and SFi can be written: 



(^2 - n) 
T2 



SF,= 



k 



n2-T2 
(ri2 - n) 



( 10 ) 



This location of the image focal point related to the out surface (the one 
nearest to the CCD matrix) is also independent of ni. the image is focused in 
air as well as in water. 



2.4 Prom Thick Model to Pin-Hole Model: The Nodal Points 
Influence 

The pin-hole model merely consists of using only one optical ray through an 
equivalent point called the optical center. The extension of the optical model to 
different media indices, shows that the role of the optical center will be played 
by the fusion of the two nodal points that conserve the angular magnification. 

In the vision community the focal length is defined by the distance between 
the CCD sensor and the optical center. It can be seen in figure (E 2 J that this 
distance is equivalent to NiFi if the object is at infinity. 



N,F, = + HiF, = 

k k k 



T2 



fo = NiF, = m r = m*Ct 

(ri2 - n) 



( 11 ) 
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This last relation is a major one for our purpose. It can be noticed that the 
(vision community) ’focal length’ is directly proportional to the object medium 
index ni, because n (glass index), ri 2 (CCD medium index U 2 = 1) and T 2 are 
constant. 

When the camera is underwater, the focal length is equivalent to 
the value measured in air multiplied by a factor 1.333. 

2.5 Distortions and Changes of the View Cone 

It is obvious that the variation of the focal length implies a decrease of the 
solid angle of view, when the canera is immersed. This variation is directly 
proportional to the index because the image size (and hence the CCD size) is 
constant. (C.f. Figure 0) 




Fig. 3. variation of the field of view, between air and water. 



What can be said about distortion ? How are air and water distortions rela- 
ted ? 

Up to now we have not been able to determine in a theoretical way the 
mathematical relation between the two distortions, moreover it is doubtful if 
such a relation exists which does not involve the complete description of the 
lens systems. Even without any distortion, the image must be magnified with a 
factor 1.333. 

Let u be the distorted image of a point in the air medium and du the distor- 
tion correction to obtain the perfect perspective projection. If now, in the same 
way u' is the distorted image of a point in the water medium and du' the new 
distortion correction, we must have: 
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1.333(m + du) = u' + du' 



( 12 ) 



3 Camera Calibration 

3.1 Protocol 

This section briefly describes the camera calibration protocol we use in the ex- 
perimental set-up in air as well as in water. The approach is mainly the photo- 
grametric one (i;e; Bumble Adjustement ), and allows by the observation of 
a small number of views (approximately 10) the sensor calibration and the ca- 
libration target reconstruction. By this method, it is not necessary to take 
care in the calibration target measurement because, but the calibration sucess 
greatly depends on the accuracy of pattern detection in the calibration images 
set. The interested reader could refer to jSl for detailed issues. 




Under the perspective projection (pin-hole model), the relation between a 
point of an object and its image is given by the following expression: 



( Xi 

Vz 

Zi 











A. 


H 


a) 


+ T 



(13) 



Where: 



— (xi, Ui, Zi) is a point defined in the camera frame (C.f. Figure Elwith z(i) = /, 
i.e. the focal length of the camera, 

— Ai is a scale factor introduced when going from to 

— (Xi,Yi,Zi) are the coordinates of the target point in the world frame W- 
XYZ, 
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— (Tx,Ty,Tz) are the coordinates of the translation vector T, 

— R is the rotation matrix 

Eliminating Xi in m (and suppressing the i index for simplicity), we obtain 
the following expressions known as colinearity equations in photogrammetry: 

^ _ f riiJC+riaV+T-iaZ+Ta, 

T 3 iX+r 32 V+rssZ+Ts; 

J r2lX-\-T22y +V23Z+Ty 

^ r^\X-\-r^2y -\- t^^Z+Tz 

If we express {x,y) in the image frame, we get: 

x= {u + Cx- uo)dx - dox 1 

y = + ey - Vo)dy - doy j ^ 

In this expression ex,&y are the measure errors along coordinates x and y, 
(i.e., corrections to add to the measurements to fulfill the projection equations). 
dox, doy are the distortion components that can be split in two parts: radial and 
tangential, (i.e., dox = doxr + doxt and doy = doyr + doyt). 

The two following expressions are commonly used in photogrammetry ID.IH] 
and we will be adopt them: 

doxr = {u — uo)dx{air'^ + a2r'^ + aar®) 
doyr = {v — Vo)dy{air^ + a^r^ + a^r^) 

doxt = Pi [r^ + 2{u- uo)'^dx^] 

+ 2 p 2 {u — uo)dx{v — vo)dy 

> 

doyt = P 2 [r^ + 2{v - vo)‘^dy'^] 

+ 2pi{u — uo)dx{v — vo)dy ^ 

In these expressions (OJ, Ij l 611 . et (H 611 . 

— u,v are the image coordinates in the image frame, 

— uq,vo are the coordinates of the principal point in the image frame, 

— ai, 02 , 03 are the radial distortion parameters, 

— pi,P 2 are the tangential distortion parameters, 

— dx,dy are the sizes of the elementar y pixel, 

— r = \/{u — uo)‘^dx‘^ + (u — v^Ydy'^ is the distance of the image from the 
principal point. 

Substituing da, da and da in (EB, we get the following system: 



(15) 

(16) 



u + Cx = Uq + {dOxr + dOxt ) / dx 

I / / i riiX+r-i-zY+r-i^Z+T,^ 
' ^ dx ) riiX+r32Y+r33Z+T^ 



v + Cy = Uo + {doyr + doyt)/dy 
\ ( _f_ \ r 2 iX-\-r 22 Y +r 23 Z-\-Ty 
' r^\X-\-r^2y+'r^^Z+Tz 



pm 

Qm, 
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SO we have: 



Perspective projection is always defined up to a scale factor. Conventionally, 
we put {dx = 1), then with fx = and fy = ^, the parameter vector # to 
estimate in the sensor/target joint calibration is: 



• 9+6m+3*n — 



Uo,Vo,ai,a2,a3,pi,P2, fx, fy, 



1 T 

rpl rpl rpl . . . T"™ T"™ TTl am 771 

Where n is the target number of points and m the number of images. 



3.2 Initial Conditions 

Optimisation of the non-linear system obtained is sensitive to the quality of the 
initial conditions. Generally the distortions coefficient are set to zero. The target 
is measured roughly (few millimeters); its planar structure eases the operation. 

Initial locations of the camera in front of the target are estimated using 
Dementhon’s algorithm ^ for planar objects. The principal point position is set 
at the image center, the focal length is set to the manufacturer estimate. Finally, 
the pixel size is set around 9 to 15 pm according to the camera manufacturer. 



4 Experimentation 

The experimental part consists of the self-calibration of an underwater camera 
in the two fluids air and water. We analyze results and try to express relations 
in regard with the previous theoretical developments. 

4.1 Hardware 

Two distinct cameras have been used for experiments. For the first one, the 
hardware system was an underwater camera made of a Sony CCD chip, a short 
focal length system and a special interface lens ensuring dry liaison between air 
and water. This lens allows an angular field of amplitude greater than 90 degrees. 
The whole video system is coupled with an automatic luminosity control device 
using two regulation loops, one for the mechanical regulation of the iris, the 
other being a gain controler for the video input signal. 

The second one is also baseded on a Sony CCD chip. However, all the op- 
tical system in this case, has been designed for experiments and the physical 
properties of each lens (index, size, position ...) is known. This permits a full 
optical simulation. 
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Fig. 5. Example of views for air calibration (768*576 pixels) 



4.2 Air Calibration 

For this first experiment we calibrate the system out of the water medium from 
twelve images of a plane target. Figure © shows some of the shots. The impor- 
tant radial distortion is to be noted. The dark circle around the image is due to 
the air medium and disappears when the camera is immersed in water. 



Results : air calibration (media index = 1) 
Under water camera 2 




camera2 : Sony + full optical system 

lens : 4mm 

digitizing card : Silicon Graphics 

algorithm : Self-calibration 








Number of images : 12 

Number of measures : 283 




Residuals mean and cr (pixel) 2.69e-05 4.15e-02 

Residuals mean ey and cr (pixel) 1.49e-04 4.45e-02 







<7 


fx(pix) 


375.65 


3.39e-01 


fy(pix) 


375.81 


3.06e-01 


uO(pix) 


390.87 


5.59e-02 


vO(pix) 


291.75 


7.31e-02 


al 


6.63e-01 


7.52e-03 


a2 


-1.15e-00 


5.02e-02 


a3 


6.83e-l-00 


1.73e-01 


a4 


-1.20e-H01 


2.65e-01 


a5 


8.95e-l-00 


1.56e-01 


pi 


-1.02e-03 


1.98e-04 


p2 


1.22e-04 


1.91e-04 



Table 1. Self-Calibration in Air. Camera number 2 



Notes: 

— The degree of the distortion polynomial has been increased in order to obtain 
a satisfactory fit with the image, due to this very distorted situation. 

— The table presents the computed values of the intrinsic parameters. It is 
to be noted that the residual at convergence is about 0.04 pixel along each 
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coordinate: this is more than the usual results obtained with our method, 
which are turning at around 0.015 pixels. They denote the difficulty of a 
very accurate detection of the target point in the image corner, and also the 
difficulty for the polynomial to fit such a large distortion. 

Nevertheless, the algorithm converges to a stable solution ffi = 376 pixels and 
fy = 376 pixels, corresponding to an observed angular field of 110 degrees 
when the distortion is compensated. 

— The distortion polynomial can be seen in Figure 0. As we get further from 
the image center correction values become really high. As the target points 
measurements have been taken on a disk of radius of no more 320 pixels cen- 
tered on the image ( C.f. Figure Ej), values of the distortion polynomial are 
mere extrapolation after this limit (Figure\^ and have no physical signifi- 
cance. The un-distorted view where the size has been increased by 400 pixels 
(line and columns) shows that inside the measurements field the distortion 
is properly corrected. 




Fig. 6. radial distorsion and corrected image (1168*976 pixels) 



4.3 Water Calibration 

In a similar way we have calibrated the system underwater from 10 images of 
the target. Each point has been detected and matched along the sequence. As 
previously emphasised, the dark circle also desappears in images. It shows that 
the intrinsic parameters have been modified. We can also notice that the angular 
field has shrunk under water as the distortion. 

Notes: 

For this second experiment, the focal length has increased to 500 pixels and 
leads to an expected field of view in water close to 90 degrees. 

We have also drawn the distortion curve. Also as expected, the point displace- 
ment is less than in air. Figure 0 presents views obtained with the underwater 
camera. In order to constrain triangulation angles involved in the self calibra- 
tion algorithm, it can be noticed that the calibration target is observed from quite 
different points of view. 
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Fig. 7. Example of shots for underwater calibration, (768*576) pixels 



Water Distortion 





Fig. 8. radial distortion and corrected image in water, (968*776) pixels 



Figure (0) shows one of the images after correction. The residuals remain at 
the same order of magnitude as in air. The black areas at the top and bottom 
correspond to image part that is not visible in the original view, (figure^number 
10 ) 



4.4 Relations between Water and Air Calibrations 



Focal length: It appears that the theoretical relation is almost completely ful- 
fflled^ 

The distance between the image nodal point and the CCD matrix 
is multiplied by the water index. 
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Results of water calibration (media index = 
underwater camera 2 


1.333) 


camera 2 : Sony + full optical system 

lens : 4mm 

digitizing card : Silicon Graphics 

algorithm : Self-calibration 








Number of images : 13 

Number of measures : 325 




Residuals mean and <7 (pixel) 1.36e-05 

Residuals mean Cy and a (pixel) -3.17e-05 


4.641e-02 

5.164e-02 







(7 


fx(pix) 


499.12 


5.51e-01 


fy(pix) 


501.97 


5.22e-01 


uO(pix) 


391.81 


9.27e-02 


vO(pix) 


292.30 


1.25e-01 


al 


7.52e-01 


1.06e-02 


a2 


-1.84e-00 


7.22e-02 


a3 


9.88e-00 


2.54e-01 


a4 


-1.73e+01 


4.11e-01 


a5 


1.30e-h01 


2.57e-01 


pi 


-1.66e-03 


2.32e-04 


p2 


1.62e-03 


1.76e-04 



Table 2. Self-Calibration in Water. Camera number 2 





f-air 


f-water 


ratio (f- water /f-air) 


fx(pix) 


375.65 


499.12 


1.329 


fy(pix) 


375.81 


501.97 


1.336 



Table 3. comparison of focal length in air and water media 



Table (j3|) shows the ratio between the focal-length in water and air. If focal 
length uncertainties given by the calibration setup are taken into acccount, this 
ratio is almost 1,333 . 

Self-calibration experiments have been carried out on many image sequences. 
The reproducibility of the results is ensured, but residuals at convergence are an 
order of magnitude larger than for calibration of a classical camera system. 

(uo,vo) location : 





Air 


Water 


uO (pixels) 


390.87 


391.81 


vO(pixels) 


291.76 


292.30 



Table 4. comparison of principal point location in air and water media 



The position of the image principal point (intersection of the optical axis and 
the CCD matrix) seems to remain unchanged between air and water calibrations 
(Table O). 

However, it is well known that uq et vq are two parameters which are quite 
sensitive because they can (at a first order approximation) be compensated by 
object translation. The poor quality of the images delivered by the first under- 
water camera system has led to principal variation of up to 5 pixels. 
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Distortion: 



Measured Air and Measured Water Distortion Predicted Water and Measured Water Distortion 




Fig. 9. Distortions: air and water curves (a), prediction of water distortion (b) from 
air data 



Figure 0a) shows a joint representation of the two distortion curves. As 
expected we can observe the large distortion in air medium compared with water 
measurements . 

As seen in section if we assume that the distortion is purely radial, the 
formula 

1.333(u + da{u)) = u' + dw{u') 

would seem logical in order to give the natural relation between air and water 
distortion. 

The right-hand part of Figure (I I l)|l shows the predicted water distortion com- 
pared to the measured one, according to the above expression. We can observe 
that the fit is quite good. Nevertheless, differences still remain if we try to overlap 
the two figures. 

As the field of view in air is larger than in water, the predicted water distor- 
tion in the image border will correspond to very accurate data measured in the 
air sequence. This is not the case for the water sequence due to the difficulty in 
obtaining a full image in the image border. 

5 Conclusion and Perspectives 

This articles gives the basis of multi fluid sensor calibration and particularly air 
and water media. Different relations are shown between the index variation, the 
focal length, the field of view and the distortion function. 

The fit between theoritical laws and measured data is almost completely ful- 
filled. For most applications that require an underwater camera, it is possible to 
carry out the sensor calibration in air and predict the intrinsic sensor parameters 
when the camera is immersed. 
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Fig. 10. Distortion: (a) Original view (768x576 pix), (b) Undistorted (968x776pix) 
from water data, (c) Undistorted from air data 



To complete this study, we are working on a full optical simulation of the 
camera from the optical properties of each lens that compose the optical system. 
It will be possible to realize a more accurate comparison between simulated and 
real images. 
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Abstract. Optimally reconstructing the geometry of image triplets from 
point correspondences requires a proper weighting or selection of the 
used constraints between observed coordinates and unknown parame- 
ters. By analysing the ML-estimation process the paper solves a set of 
yet unsolved problems: (1) The minimal set of four linearily independent 
trilinearities (Shashua 1995, Hartley 1995) actually imposes only three 
constraints onto the geometry of the image triplet. The seeming contra- 
diction between the number of used constraints, three vs. four, can be 
explained naturally using the normal equations. (2) Direct application of 
such an estimation suggests a pseudoinverse of a 4 x 4-matix having rank 
3 which contains the covariance matrix of the homologeous image points 
to be the optimal weight matrix. (3) Instead of using this singluar weight 
matrix one could select three linearily dependent constraints. This is dis- 
cussed for the two classical cases of forward and lateral motion, and clar- 
ifies the algebraic analyis of dependencies between trilinear constraints 
by Faugeras 1995. 

Results of an image sequence with 800 images and an Euclidean parametri- 
zation of the trifocal tensor demonstrate the feasibility of the approach. 



1 Motivation and Problem 

Image triplets reveal quite some advantage over image pairs for geometric image 
analysis. Though the geometry of the image triplet is studied quite well, im- 
plementing an optimal estimation procedure for recovering the orientation and 
calibration of the three images from point, and possibly line, correspondencies 
still has to cope with a number of problems. 

1.1 The Task 

This paper discusses the role of the trilinear constraints between observed coor- 
dinates and unknown parameters [12, 13, 2, 8, 16] within an optimal estimation 
process for the orientation of the image triplet and shows an application within 
image sequence analysis. 

The task formally can be described as following. We assume to have observed 
J sets {P'{x',y'),P"{x",y"),P"'{x"',y"'))j,j = of corresponding points 
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in an image triplet. For each set of six coordinates = (a;', j/', x", j/", cc"', y'")J 
of three corresponding points we have a set of Gj generally nonlinear constraints, 
here the trilinear constraints gj{yj,l3) = 0 , which link the observed coordinates 
with the U parameters /3 of the orientation of the image triplet, specihcally the 
U = 27 elements [16] of a 3 x 3 x 3 tensor, termed trifocal tensor by [8]. There 
may be additional H constraints h{(3) = 0 on the parameters alone, which in 
our case reduce the number of degrees of freedom of the trifocal tensor to 18 
[8,21,3]. The task is to hnd optimal estimates for the parameters taking the 
uncertainty of the observed coordinates, e. g. captnred in a covariance matrix 
Syy, into account. 

In this work, we are primarily interested in the optimal determination of 
the orientation and calibration of the three cameras, not in the elements of 
the trifocal tensor per se. We also assume some approximate values for the 
parameters to be known either by the camera setup, as e. g. in motion analysis 
or by some direct solutions. This is no severe restriction, as such techniques 
are available for a large class of setups. However, the optimal estimation of the 
orientation and calibration parameters, though used in [21], has not been treated 
in depth up to now. 

1.2 Problems 

There is a set of yet unsolved problems which are sketched here but worked out 
later: 

PI: The number Gj of constraints: Shashua [16] showed that there exists 
a set of Gj = 9 constraints Qj with unique properties: They are linear in 
the coordinates of the three homologeous points and in the elements of the 
trifocal tensor. Up to fonr of them are linearily independent. However, as 
six coordiates are used to determine the three coordinates of the 3D-point 
only three of them actually constrain the orientation of the image triplet. 
Therefore the number of constraints to be used should be Gj = 3. Thus 
there seems to be a contradiction in counting independent constraints. 

P2: Choosing Gj constraints: As the choice of these, three or four constraints 
depends on the numbering of the images we alltogether have 12 constraints. 
In addition we also could use the 3 epipolar constraints, being bilinear in the 
coordinates, for constraining the orientation. Though the algebraic relations 
between these constraints are analysed in [2] , no generally valid rule is known 
how to select constraints. Therefore we have the problem to choose a small 
subset of Gj = 3 constraints from a total of 15, for determination of the 
orientation, thus, presuming problem PI has been clarified. The problem is 
non trivial because a subset, which is well suited in one geometric situation 
may be unfavorable in another, leading to singularities. 

P3: Weighting the constraints: Another way to look at the problem is to 
ask for the optimal weighting of the constraints, being more general than 
choosing [15]. Then the question arizes where to obtain the weights from, 
how to take the geometry into account, how to deal with singular cases and 
how to integrate the uncertainty of the matching procedure. 
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P4: Modelling and optimally estimating the geometry: There are sev- 
eral possibilities to model the geometry of the image triplet: (a) using the 
unconstrained U = 27 elements of the trifocal tensor as unknown param- 
eters, (b) using the U = 27 elements of the trilinear tensor as parameters 
with H = 9 constraints on the parameters, (c) using a minimal parametriza- 
tion with H = 18 parameters or (d) even restricting the geomtry to that of 
calibrated cameras, leading to an Euclidean version [13, 17] of the trifocal 
tensor involving U = 11 parameters for the relative orientation of the first 
two images and the 6 parameters of the orientation of the third image. The 
question then arises how an optimal estimation could be performed in each 
case, and how and under which conditions the estimates differ. Moreover, 
how are the above mentioned problems effected by the choice of the model? 

We want to discuss these problems in detail. 

1.3 Outline of the Paper 

We first (section 2.1) present a generic model for representing parameter esti- 
mation problems. The resulting normal equation matrix, which represents the 
weights of the resulting parameters, can be used to analyse the quality of the 
result. The trilinear constraints on the observed coordinates can be interpreted 
geometrically (sect. 2.2) and allow a transparent visualization of the constriants 
within an image triplet (sect. 2.3). Based on different models for the image triplet 
(sect. 3.1) we discuss the number and the weighting of the contraints (sect. 3.2) 
and the optimal choice of the constraints for the classical cases of lateral and 
forward motion, leading to general selection rules (sect. 3.3). Sect. 4 presents an 
example on real data to prove the concept using a metric version of the trifocal 
tensor. 

Notation: Normal vectors x and X and matrices R are given in italics, homoge- 
neous vectors x and X and matrices P in upright letters. If necessary for clarity, 
stochastical variables are underscored, e. g. x being the model variable for the 
observed value x. True values are indicated with a tilde, e. g. x. 

2 Basics 

2.1 Modelling and Estimation 

In this section we describe a broad class of estimation problems (cf. [20]) whose 
solution is obtained by solving an optimization problem of the same general 
form. In all cases the task is to infer the values of U non observable quantities 
Pu from N given observations Un fulfilling the constraints given by the geomet- 
rical, physical or other known relations. We treat these quantities as stochastic 
variables in order to be able to describe their uncertainty. As this takes place 
in our model of the actual setup, we distinguish stochastic variables x and their 
realizations (observed instances) x. 
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Modeling the Observation Process We assume that there are two vectors 
of unknown quantities, the N vector y = {yi, j/„, ...jX/nV , and the U vector 
(3 = (/3i, ...,/3u, ...,I3 uY participating in the G relations 

g(y,/3)=0 (1) 

whose structural form is known. The values xjn represent the true values for 
the observations, which according to the model are intended to be made. The 
parameters /3„ are assumed not to be directly observable. In our application 
these constraints are the trilinearities between observed image coordiantes and 
parameters of the geometry of the image triplet, worked out later. 

In addition, it may be that the unknown parameters /3 have to fulfill certain 
constraints, e. g. /3^/3 = 1. We represent these H constraints by 

h0) = 0 (2) 

In our application these may be 9 constraints on the 27 elements of the trifocal 
tensor (cf. [8]), to completely model the image geometry. 

We now observe randomly perturbed values y of the unknown vector y. We 
model the random perturbation as an additive random perturbation assuming 
the random noise vector e is assumed to be normally distributed with mean 0 
and covariance matrix = cr^Q ee 

y^y + e e ~ iV(0, i:ee) = A''(0, (J^Oee) (3) 

The covariance matrix is separated in two factors: a positive definite sym- 
metric matrix Q gg, also called the cofactor matrix (cf. Mikhail & Ackermann 
1976) being an initial eovarianee matrix, giving the structure of Sgg, and the 
multiplicative variance factor to be estimated. 

This separations has two reasons: One often only knows the ratios between 
the variances of the different observations and under certain conditions the es- 
timation process is independent on the variance factor. The initial covariance 
matrix Qgg is fixed and assumed to be known. It may result from previous 
experiments involving the same kinds of observations involved in the current 
observation. The initial covariance matrix Qgg contains within it the scaling of 
the variables, their units, and the correlation structure of the observed variables. 
The variance factor is an unknown variable for the multiplier on the known 
initial covariance matrix. It will be estimated using current data. 

The complete model, represented by (1), (2) and (3), is called the Gauss- 
HELMERT-mode; (cf. [9]) 

There are various special cases of this model. The most important one is the 
socalled GAUSS-MARKOFF-model, y = g0) (cf. [6], p. 213, [11], p. 218) where 
the observation process is made explicit, like in classical regression problems. 

We will apply the complete model here for using the trilinear constraints on 
the coefficients of the trifocal tensor for estimating the relative orientation of the 
image triplet and especially for analysing the ranks of the matrices involved for 
discussing the number of necessary constraints. 

Estimating Parameters The estimation problem we wish to solve now 
is: Given y, estimate y,/3, and the most probable values for y, f3 and 
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We solve this problem by finding the value (y,/3) for (y,/3) that minimizes 
the weighted sum of squares of residuals, the weight matrix being the inverse 
covariance matrix 4>{y,y) = 1/2 (y — yyQyy{y — y) subject to the constraints 
g{y,/3) — 0 and h{/3) = 0. This is equivalent to finding the minimum of 

<P{y,/3,\,y) = ^{y - yf Qyy {y - y) + g{y, ^) + h0) (4) 

where A and fi are G and _ff-vectors of Lagrangian multipliers. The solution 
is the ML-estimate, in case observations actually follow a normal distribution. 
Otherwise they are (locally) best linear unbiased estimates, i. e. estimates with 
smallest variance. The general solution of this optimization problem is given in 
the appendix. 

We only need the normal equation matrix N here, which contains the covari- 
nace matrix of the estimated unknown parameters j3 in its inverse. With the 
Jacobians A and 6 of y with respect to the unknowns and the observations, and 
the Jacobian H of h with respect to the unknown parameters and the assump- 
tions that these matrices have full rank we obtain the normal equation matrix 

with some matrices S and T, cf. (28) appendix. 

We will be able to identify the rows of the matrix A with the Jacobian of 
the trilinear constraints w. r. t. the elements of the trifocal tensor, the matrix 
(B Q yyB~^)^^ with the sought weight matrix for the trilinearities containing the 
(initial) covariance matrix Qyy of the observed coordintes and analyse the rank 
of these matrices. 



2.2 Projection Matrices and their Interpretation 

The geometric setup of three images is given by 

X, = P,X i = l,2,3 (6) 

which relate the coordinates = {X, Y, Z, 1) of the object point to the three 
sets of coordiantes xj = (ui,Vi,Wi) with the (Euclidean) image coordinates 
x' = Ui/wi, y' = Vi/wi, x" = etc. The three projection matrices are 




where the rows are indicated with bold face numbers. With the standard parame- 
trization of the projection matrices 



P, = K,/?,(/|-Xo,) 



( 8 ) 
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Eq. (6) relates Euclidean object-space to Euclidean image space, capturing the 
(Euclidean) object coordiantes Xo of the projection centre, the rotation R and 
the calibration K, being an upper triangular matrix with 5 free parameters. 

We now describe the geometry of the image triplet using the vectors 1, 2 
etc. in detail. We use the following interpretation of the rows 1, 2 etc. of the 
projection matrices (cf. [2]): In case Ui = 0 and (ui, Wi) arbitrary, we have 
1-X = 0, thus the vector 1 represents the homogeneous coordinates of the plane 
passing through the y'- and the ^'-axis in the first camera; in the special case of 
K = Diag{ci,C 2 , 1), i. e. reduced image coordinates but arbitrary focal lengths Ci, 
they are perpendicular to the x'-axis. By analogy, 4 and 7 are planes containing 
the and the x^*^-axes in the second and the third camera, 2, 5 and 8 are 
planes containing the and the x^'^-axes, and 3, 6 and 9 are planes containing 
the and the ?/*-®^-axes in the three cameras. Observe, all these planes pass 
through the corresponding projection centre. 

As ui : Vi : Wi = (1-X) : (2-X) : (3-X) and correspondingly for the other 
cameras, we have the following equivalent homogeneous constraints for the image 
coordinates: 




The vectors A^, B^, have a specific geometric meaning [18]: 

The vectors Ai represent planes through the origin of the f-th camera, as 
they are linear combinations of the plane vectors; they pass through the Vi~ 
axis of the f-th camera, as it is contained in both planes 1 and 3; they pass 
through the image point Pi, due to eq. (9); therefore they intersect the image 
plane in the line Ui = const. The vectors represent planes through the origin 
of the z-th camera, pass through the a;^®^-axis of the z-th camera, pass through 
the image point Pi and thus intersect the image plane in the line Vi = const. 
Now, the vectors Di represent planes through the origin of the z-th camera, 
pass through the ^*^®)-axis of the z-th camera, pass through the image point Pi 
and thus intersect the image plane radially, fixing the direction, motivating the 
notation. 

Observe, the planes are not defined or are instable for points identical or 
close to the origin (0, 0). Thus, planes A^, B^ and fix the a;^®)-, the z/(®^- and 
the ’directional’ coordinate. Only two of the three constraints for each camera 
are independent. 
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2.3 Constraints between Points of an Image Triplet 



Shashua’s Four Constraints on the Trifocal Tensor Elements We now 

easily can write down Shashua’s constraints [16]. They can be formulated using 
the above mentioned planes, by establishing quadrupels of planes which should 
intersect in a 3D point, which is equivalent to requiring the 4x4 matrix of the 4 
plane coordinate vectors to be singular or its determinant to vanish: 

Df = |AiBiA 2A3|=0, Df = |AiBiA2B3|=0 (10) 

D| = |AiBiB 2 A 3|=0 Df = IA1B1B2B3I =0 (11) 



where [••••I denotes the determinant of the vectors. The point-line-line constraint 
results from the fact that the first two vectors Ai and Bi fix the ray through 
the point in the first image and the two other vectors represent lines through the 
points in the second and the third image each (cf. the geometric interpretation 
above) 

Observe that these constraints are linear in all image coordinates, as each of 
these coordinates appears only once in the determinants and the rci-coordinate 
can be set to 1 for all image image points, cf. (9c). 

Shashua moreover showed that the constraints (10), (11) can be written as 
linear functions of the 27 entries of a 3 x 3 x 3 tensor with elements t, thus each 
is of the form „ 

= 0 ; = 1,2,3,4 (12) 

where the 27-vector af^ = af{yj) only depends on the six coordinates of the 
point triple collected in the 6-vector yj (here indexed with j to indicate the 
used point triple), and the 27-vector t contains the tensor coefficients. Shashua 
showed the 27 x 4 matrix 



At j , (y.2j, f^4j> 




k = 1,...,27; j = 1,..., J 
I = 1,...4 



(13) 



to have rank four. Observe that is the transposed Jacobian of the constraints 
(10 ff.) with respect to the parameters t. 

This suggests 4 constraints are necessary if one wants to exploit the full 
information of the image points for recovering the geometry of the image triplet. 



Three Constraints between the Observations and the Triplet’s Geom- 
etry However, we could argue only to need three constraints: 

If one solves the basic projection equations (6) for the 6 observed coordinates 
one obtains 6 inhomogeneous equations. One now can take three of them and 
solve for the 3 coordinates of the object point. Substituting these object coor- 
dinates into the other three inhomogeneous equations yields three constraints 
between the six imge coordinates and the parameters of the geometry of the 

^ The four constraints correspond to those given in [8]: u"'Tk 33 — u”'Tki 3 — 

u"Tk3j + Tkij) = 0 with the combinations (1, 1), (1, 2), (2, 1), (2, 2), for the indices i 
and j and homogeneous coordinates (m'i,W 2 ,M 3 ) and u '3 = 1 etc. 
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image triplet, independent on the parametrization. Thus there can only be 3 
independent constraints between the observed image coordiantes and the pa- 
rameters of the geometry of the image triplet. 

In a general setup one could argue that 1.) between the first two points 
Pi and P 2 the epipolar constraints should be valid and that 2.) the 3D point, 
determined from the first two images, should map into the third image. This 
prediction was the basis for the derivation of the trifocal tensor in [ 8 ] . 

The epipolar constraint then reads as^ 

D{ = |Ai Bi A2 B2I = 0 ( 14 ) 

The first two vectors Ai and Bi span the ray in the first image, whereas the last 
two vectors span the the ray in the second image, which should intersect. The 
3D point from the first two images could be determined as the intersection of 
the planes Ai, Bi and A 2 which should ly in the two planes A 3 and B 3 , which 
gives rise to two further constraints, namely: 

= IA1B1A2A3I = 0 , = IA1B1A2B3I =0 ( 15 ) 

which are identical to the first two Df and Df of Shashua’s constraints^. 
Singular Cases Unfortunately this set of constraints does not work in general. 

First, assume the three images have collinear projection centres, establishing 
the A-axis in 3D and the rotation matrices are Ri = I . Then the two planes Ai 
and A 2 intersect in a line parallel to the U-axis, which, when intersected with 
Bi yields a well dehned 3D point. 

Now, if the three projection centres establish the y-axis the two planes Ai 
and A 2 are identical, as they are epipolar planes. Thus the 3D point cannot be 
determined using these two planes. In case the constraints and D| would 
be replaced by the last two constraints (11) of Shashua, we would be able to 
determine and predict the 3D-point in this case, but not in the previous one. 

We therefore need to clarify the number of necessary constraints and discuss 
the selection or, more general, the weighing of the constraints. 

3 Constraints within the Estimation Process 

We now want to establish a statistical interpretation of such dependencies. 
Therefore we follow sect. (2.1), and model the reconstruction of the geometry of 
the image triplet. 

3.1 Models 

We distinguish three parametrizations: 

Ml: Tensor coefficients: We use the 27 elements t of the trifocal tensor as 
parameters to desrcibe the geometry. We therefore use (12) as constraints 
for each point triplet. We have to distinguish this model from the following: 

^ The superscript 7 indicates case I in the analysis later. 

® and to Hartley’s constraints with indices (1, 1) and (1,2) cf. previous footnote . 
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M2: Projective parametrization: We use a minimal parametrization of the 
trifocal tensor with 18 parameters (cf. e. g. [21]). This leads to a projective 
reconstruction. We need not specify the parametrization for our analysis. 
Instead, we also could use the U=27 tensor coefficients and H—9 appropriate 
constraints between these parameters (cf. [3]). 

M3: metric parametrization: We use a metric parametrization of the trifocal 
tensor with only 11 parameters in order to achieve an Euclidean reconstruc- 
tion. The reason is: in our special application of image sequence analysis, 
we are able to calibrate the cameras in beforehand. Therefore we only have 
11 parameters to specify the geometry of the image triplet, namely the 5 
parameters of the relative orientation of the first two cameras as above and 
the 6 parameters of the exterior orientation of the third camera (cf. [12, 13, 
17]). In our implementation we actually parametrize the orientation by the 
two translation vectors Xo 2 and Xos, and the two quaterions q 2 and qs 
for the rotations, hxing Xoi = 0 , and yielding [/ = 14 parameters with the 
H = 3 constraints X^ 2 ^o 2 = 1, qjq 2 = 1 and qjqs = 1. This model will 
be used in the example. 

In the last two cases M2 and M3 we may use the same constraints as above, by 
just replacing the 27 elements tk of the trifocal tensor by 27 functions tk{/3) of 
the 18 and 14 unknown parameters, thus the constraints (12) now read as 

fty(yy,/3) = t(/3) = 0 Z = 1,2,3, 4 (16) 

The corresponding contraints of set A (14), (15) read as: 

^ J) = Df ii, ^ J) = af t0)=O I = 1,2,3 (17) 

In case of model 3 we in addition have the H = 3 constraints between the 
parameters only: 

= (^L^o2 - 1 qlq2 - 1 qjqa - i) = o (18) 

3.2 Number and Weighting of Constraints 

We now discuss the left upper submatrix N from (5) in our context. In case of 
j = 1, ..., J statistically independent triplets of points, thus Qyy = Diag{Q y.y. ), 
which is no restriction in practical cases, it can be written as 

J J ,7 

= (19) 

i=i i=i i=i 

using A = (/A J) and B = (BJ). 

Each part Nj depends on three matrices, Aj, Bj and Qy^yy They have a 
very specihc semantics. They give the key to the solution of the stated problems: 
Coefficient matrix A j-. The matrix Aj is the Jacobian of the constraints 
gj{yj,/3) with repect to the unknown parameters evaluated at the fitted values 
/3 and yj of the parameters and the observations resp. (cf. App.). 
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In case the constraints are linear in the unknown parameters the matrix Aj 
only depends on the fitted coordinates y^. Moreover, then one may use them for 
a direct solution of the unknowns (3 being the eigenvector corresponding to the 
smallest eigenvalue oi N = A = minimizing the algebraic distance. 

This shows the close relation between the optimal nonlinear estimation and the 
direct solution: The constraints are not weighted; the direct solution obviously 
is an approximation. The weights Wj depend on Bj, which itself depends on 
the unknown parameters, thus are not available in a one-step solution. Due to 
the linear independency of the 4 constraints per point triplet, at least 7 points 
are necessary (cf. [16]) for the determination of the 27 tensor elements. 

In case of 18 parameters the Jacobian A j turns out have rank 3 in general, as 
can be shown using Maple. This is due to the projection of the 27 dimensional 
space of tensor parameters t to the 18 dimensional space of parameters /3. 

This can be geometrically visualized as follws: Without posing restrictions, 
assume the translation vector Xo 2 = (Wo2,0,0)^, and calibrated cameras with 
Ki — I , Ri = I . Then the two last constraints D§ and Df both constrain the 
two first rays to follow the epipolar geometry^ if the object point is in general 
position: this is because, Ai and Bi and the last plane A3 or B3 in (10) fix the 
object point. The plane B2 determined by the j/2-coordinate then has to pass 
through that point, in all three cases yielding the same constraint i/' — y' — 0. 

Analytically, the two constraints in general are polynomials, which factor 
into, say, and U4U4, where in general position of the point the first factors 
U3 and U 4 are non zero and the second factors are identical, U3 = V 4 , thus both 
constraints, though algebraically different, impose the same restrictions onto the 
image geometry. 

This shows the two geometric setups, with 27 and 18 parameters resp., to 
differ in essence, solving problem PI, and explains why there is no real contradic- 
tion between the number of necessary constraints: Shashua’s set of 4 constraints 
is necessary for estimating the geometry coded in the elements of the trifocal 
tensor, whereas only 3 constraints are necessary in case one wants to determine 
the projective geometry of the image triplet with 18 parameters. Observe, in this 
case one also could take the 27 elements of the trifocal tensor as unknowns f3 
and introduce 9 constraints h on these parameters alone, this would not change 
the reasoning. 

Coefficient matrix Bj: The matrix Bj is the Jacobian of the constraints 
with repect to the observations evaluated at the fitted values (3 and of the 
parameters and the observations resp. (cf. App.). 

It is implicitely used in the solution to determine the (preliminary ) covari- 
ance matrix Qg^g^ = BJ Qy.y.Bj of the contadictions c = gj{y.,j3^^'^) 7 ^ 0 , 
i. e. the deviation of the constraint evaluated at the observations y and the 

—j 

approximate values of the parameters by error propagation. The weightma- 
trix W j — Q , being the inverse of this covariance matrix, therefore is the 
optimal choice. This solves problem P2, namely the choice of the weight matrix. 

If model Ml with the 27 tensor coefficients as unknowns is chosen, the rank 
of this weight matrix in general is fonr, indicating that all 4 constraints actnally 
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are relevant and can be adequately weighted. However, for models M2 and M3 
with 18 or less parameters describing the projective or Euklidean geometry the 
weightmatrix in general has rank 3. This confirms the fact that only three inde- 
pendent constraints are available. The Jacobians Aj and Bj have the same null 
space. Taking some generalized inverse 

= ( 20 ) 

when using more than 3 constraints does not lead to a different solution of the 
estimation problem. 

This type of weighting with Wj has been used by [21]. The Jacobian B 
corresponds to Jacobian J in their eq. (20), which they state to have rank 3. 
They use the pseudo inverse {J (24) (via a SVD) instead of the normal 

inverse {B jSyyB~^)~^ of a minimal set. This is more time consuming, compared 
to inverting a regular 3 x 3-matrix, especially if the number of used constraints 
is much larger than 3, as this computation has to be performed for every point 
triple in every iteration, which may be essential in real time applications. 

However, the analysis confirms the direct 6-point solution of [21]to be a solu- 
tion for the minimal number of points, as the number of free parameters of the 
trifocal tensor is 18 [16]. 

The weighing proposed in [15] is only an approximation as the rank of the 
weight matrix there is 4 instead of 3. 

Covariance matrix Qy^y/- As to expected, the weighting of the constraints 
depends on the uncertainty of the feature points or generally of the matching 
procedure. This uncertainty can be captured in the covariance matrix Qy^y^ of 
the 6 coordiantes. Usually a diagonal matrix / will be sufficient. If the matching 
technique provides a realistic internal estimate of the variances this could be 
used to improve the result. 

Observe, if Qy^y^ = I then the direct solution would use the smallest eigenvec- 
tor of {B B^)~^A . This least squares solution is identical to that given by [19], 
as (BB~^)^^ = Diag{l/\Vgj\'^) with the gradient magnitude of the constraints 
w. r. t. the observations. However, it here naturally follows from the general 
solution in a statistical estimation framework as a special case, and shows how 
to handle observed quantities which are correlated. 

3.3 Choosing Independent Constraints 

Instead of using a pseudo inverse for automatically getting the correct weight we 
also could choose a set of three independent constraints. The chosen set obviously 
will depend on the position of the object point with respect to the trifocal plane: 
If it is off the trifocal plane, three pairs of epipolar constraints would work. Thus 
we only analyse the important case where the projections centres are collinear. 
Then all object points lie on a trifocal plane, requiring at least one trilinear 
constraints on the tensor coefficients. We summarize the analysis from [5] here. 

We assume image sequences with = /, = Diag{c,c,l), thus prin- 
cipal distance c = and distinguish forward motion in Z-direction with 

= (0,0, B)^, = (0,0, 2i3)^ and lateral motion in A-direction with 




680 W. Forstner 



= (B, 0, 0 )T, = (2B, 0, 0 )T, thus base length B = We apply two 

different sets of constraints. The first is set A as in eq. (14) and (15). Trying to 
obtain full symmetry by using every coordinate twice and fixing each image ray 
in one of the three constraints [2, 18]) we obtain constraint set II: 

D{^ = |Ai Bi A 2 Bal = 0, = |Bi A 2 B 2 A 3 I = 0, = |Ai B 2 A 3 B 3 I = 0 

We give the determinants of the matrices Qg^g^ = {B~^ Qy.y.B), being propor- 
tional to the corresponding covariance matrices, in dependency of the object 
coordiantes (A, Y, Z) for lateral (1) and forward (/) motion and for set I and 
II, d{Z) being a function of Z only: 

\Q^^fy]\ = = X\X^ + Y^)-d{Z) (21) 

= 0 I = X^Y\X^ +Y^yd{Z) (22) 

Only if the determinat is not 0 the weightmatrix \N j has the proper rank. There- 
fore, the set I obviously is useful for all points in lateral motion, as the covariance 
matrix is regular, with a determinant independent on the position. The symmet- 
ric set II, however, is not useful at all in lateral motion. This is plausible, as 
only the y-coordinates are taken into account, i. e. this set then is a variation 
of the trifold use of the epipolar constraint. Both sets do quite a good job in 
forward motion, however lead to singularities if the points lie on the axes, on the 
x-axis for set I, on one of both for set II. Observe, that the origin (0, 0) is the 
focus of expansion (FOE): points in the direction of the motion cannot be used 
at all, which is counter intuitive. They actually only constrain the rotation, not 
the translation, thus lead to only two constraints, causing the rank deficiency. 

General rules for ehoosing three eonstraints are the following, solving problem 
P2 while distinuishing between 1, 2 and 3 trilinear constraints within the set: 

1. One trilinear constraint and two epipolar constraints: The trilinear constraint 
in lateral motion needs to be one of |Ai, A 2 , A 3 , B^j i = 1, 2, 3. In forward 
motion we distinguish between points right or left of the FOE, for which the 
previous constraints works, and points above or below the FOE, for which 
one chooses one of |Bi, B 2 , B 3 , Ai\,i = 1, 2, 3^. 

2. Two trilinear constraints and one epipolar constraint: For lateral motion 
(A-direction) choose set I. For forward motion we again choose the sets 
according to position relative to the FOE, namely the determinants 

D'’’' = |AiBiA 2 B 2 |, !)'’’■ = IA 1 B 1 A 2 A 3 I, = IA 1 B 1 A 2 B 3 I 
D“’'' = |AiBiA 2 B 2 |, D“’'' = |AiBiB2A3|, = IA 1 B 1 B 2 B 3 I 

to be zero {l,r = left /right, a,b = above/below the FOE). 

3. Three trilinear constraints with the same ray fixed in all constraints are gen- 
erally independent if no constraint contains 3 planes parallel to a/the trifocal 
plane®. E. g. the set Di = |Ai, Bi, A 2 , A 3 I =0, D 2 = |Ai, Bi, A 2 , B 3 I = 0, 
D 3 = I Ai, Bi, B 2 , A 3 I = 0 is independent in general; in lateral motion for 
all points, in forward motion at all points except with A = 0 or T = 0. 

^ [ 13 ] proposes |3, Di, D2, Dsj = 0 in lateral motion only useful for points with A 7^ 0 . 
® The set of constraints discussed in [ 2 ], p. 16 Tl,2,3,5 = |Di,Bi, D2, Daj, 71,3,4,5 = 
]Di,D2, B2, D3I, Ti,3,5,6 = |Di,D 2, D3, B3I has rank 1 in lateral motion in x- or 
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4 Example 

The usefulness of the estimation procedure for the metric version of the trifocal 
tensor is investigated [1]. An image sequence with 5300 images is taken from a 
car. The measured speed is shown in figure 1 top. The camera is looking ahead, 
establishing the case forward motion. A subsecquence of 800 images has been 
evaluated w. r. t. the geometric analysis of image pairs and image triplets. A 
subsection of 100 images was finally used for a bundle triangulation. 

The Procedure: After initializing the procedure, interest points are selected 
in image i which promise good correspondence [4]. Using the correspondencies 
from the two previous frames we predict points in the current image using two 
trilinear constraints suited for that point. Thereby we assume constant motion, 
thus constant Xq and R . All interest points within an adaptive search area are 
checked for consistency using normalized crosscorrelation. Possibly their position 
is corrected based on the point in image i — 1 using a least squares matching 
procedure [7], chap. 16, at the same time yielding internal estimates for the 
uncertainty 

These point triplets are used for estimation. We applied the set of constraints 
I (14, 15) here and used a pseudo inverse to cope with singularities, which are 
possible (cf. (21b)). The 14 parameters with the 3 constraints of the third model 
are estimated using the GAUSS-HELMERT-model, however in a robustified ver- 
sion, by a reweighting scheme following [10]. Figure 1 bottom shows the number 
of used points per successfully determined image triplet, which excludes the 
images with velocity 0. 

Finally, new points are detected and possibly linked to the previous image. 
The next image i -I- 1 is taken as third in the next image triplet, which uses the 
metric parametrization of the two previous images as approximate values. This 
chaining is not meant to be optimal not even consistent, as it only is used to 
yield approximate values for the image sequence, which then were to be optimally 
reconstructed in one process using a bundle adjustment. 
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Fig. 1. shows speed (top) and number of matched points over 5300 images (bottom) 



Results: Some results of the extensive experiments, documented in [1] can be 
summarized as follows: 

j/-direction, as the three planes Di are parallel to the a-axis and should intersect in 
one ray, which is expressed equivalently by all three constraints; the set has rank 0 
in forward motion in 2 -direction as they all contain three planes passing through the 
motion axis. 
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Fig. 2. shows above the Y-coordinate of the translation vector T 2 determined from 
the first 800 image pairs, revealing quite a number of erroneous values cause by mis- 
matches. The lower row shows the Y-coordinate determined from image triplets, clearly 
demonstrating the effect of higher reliability from ([17]) 



4 >[°] 




Fig. 3. Estimated rotation angles of the second image w. r. t the first. Observe the 
typical vibration in the ui. 




Fig. 4. Baekprojection of trajectory of image sequence with 1 00 images and 3D point 
cloud together with trajectory 



• The quality of the motion parameters are much higher when using the image 
triplet than when only using image pairs (cf. fig. 2). 

• The estimation of the rotation angles (cf. fig. 3) reflects the expected behaviour, 
especially vibrations in nick-angle, i. e. the oscillations of u) around the horizontal 
x-axis orthogonal to the speed vector. 

• The approximate values obtained from the image triplets were sufficiently 
accurate to guarantee convergence of a global ML-estimation with a bundle 
adjustment (cf. fig. 4). 
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Appendix 

We give the solution for the optimization problem (4) derived from the Gauss- 
Helmert-model. For solving this nonlinear problem in an iterative manner we 

need approximate values /3*' ^ and for the unknowns 13 = p!' + A/3 and 

y = + Ay which obtain corrections A/3 and Ay in an iterative manner. 

With the Jacobians 



/ dgiP,y) \ 
\ a/3 J 



D 

/3=?(0) ’ ^ ~ 
y=y(0) 



( dg{P,y) \ 

1 dy J 



H = 

/3=/T(0) ’ 

y=y(0) 




(23) 



and the relation Ay = (y — — e we obtain the linear constraints y(/3, y) = 

y(3^°\y^°^)+/l A/3+BAy or g0,y) = Cg + A Ap~ Be and h{P) = Ch + HAp 

with Cg — g{P ,y^°3 + ®(y~y^°3 Ch = h{P ) are the contradictions 

between the approximate valnes for the unknown parameters and the given ob- 
servations and among the approximate values for the unknowns. 

Setting the partials of (4) zero yields 



d<P 

dy^ 



— Q yy e B A — 0 



d<P 



Cg + A Ap — Be = 0 



d<P 

dp^ 

dyJ 



= A~^\+H~^y = 0 



= Ch + H^ = 0 



(24) 

(25) 



From (24a) follows the relation 

e = QyyB^X 



(26) 



When substituting (26) into (25a), solving for A yields 
X={BQyyB^r\cg + A^) 

Substitution in (24b) yields the symmetric normal equation system 



I^A^iBQyyB^r^A 



-A^iBQyyB^r^c. 

-Ch 



(27) 



(28) 



The Lagrangian multipliers can be obtained from (27) which then yields the 
estimated residuals in (26). The estimated variance factor is given by 



-2 

— 



yy * 



G + H-U 



(29) 



The number R of contraints above the number U — which is nessessary 
for determinimg the unknown parameters, the redundancy is the denominator 
R = G — iU — H). We finally obtain the estimated covariance matrix 




(30) 



of the estimated parameters, where results from the inverted reduced nor- 
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mal equation matrix using N = A ' {B Q yyB ' ) 

H 0 

This expression can be used even if N is singular. 




( 31 ) 
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Abstract. We present a new symmetry-based method allowing to auto- 
matically compute, reorient and recenter the mid-sagittal plane in ana- 
tomical and functional 3D images of the brain. Our approach is compo- 
sed of two steps. At first, the computation of local similarity measures 
between the two hemispheres of the brain allows to match homologous 
anatomical structures or functional areas, by way of a block matching 
procedure. The output is a set of point-to-point correspondences: the 
centers of homologous blocks. Subsequently, we define the mid-sagittal 
plane as the one best superposing the points in one side of the brain and 
their counterparts in the other side by reflective symmetry. The estima- 
tion of the parameters characterizing the plane is performed by a least 
trimmed squares optimization scheme. This robust technique allows nor- 
mal or abnormal asymmetrical areas to be treated as outliers, and the 
plane to be mainly computed from the underlying gross symmetry of the 
brain. We show on a large database of synthetic images that we can ob- 
tain a subvoxel accuracy in a CPU time of about 3 minutes, for strongly 
tilted heads, noisy and biased images. We present results on anatomical 
(MR, CT), and functional (SPECT and PET) images. 



1 Introduction 

1.1 Presentation of the Problem 

A normal human head exhibits a rough bilateral symmetry. What is easily obser- 
vable for external structures (ears, eyes, nose...) remains valuable for the brain 
and its components. It is split into two hemispheres, in which each substructure 
has a counterpart of approximately the same shape and location in the opposite 
side (frontal, occipital lobes, ventricles...). They are connected to each other by 
the corpus callosum, and separated by a grossly planar, mid-sagittal, fissure. 

However, it has been reported since the late 19th century that conspicuous 
morphological differences between the hemispheres make the brain systemati- 
cally asymmetrical. For example, the wider right frontal and left occipital lobes 
give rise to a torque effect of the overall brain shape (see Fig.P). More subtly, 
the natural variability of the cortex translates into slight differences between 
hemispheres. In the same way, cerebral dominance has been demonstrated since 
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© Springer- Verlag Berlin Heidelberg 2000 



686 



S. Prima, S. Ourselin, and N. Ayache 



the work of Paul Broca on the language lateralization (1861), and many brain 
functions are now thought or known to be located in mainly one of the hemisphe- 
res (handedness, visual abilities, etc.). The question of whether the anatomical 
and the functional brain asymmetries relate to each other remains debatable to 
the point, even if evidences of close connections have been demonstrated quite 
lately |^. These studies suggest that symmetry considerations are key to the 
understanding of cerebral functioning. 





Fig. 1. Torque effect of the brain. The 

right frontal lobe (1) is larger than the left 
one, and this is the opposite for the occi- 
pital lobe (11). Description of the hemis- 
pheres: 1. Frontal pole 2. Superior frontal 
sulcus 3. Middle frontal gyrus 4. Superior 
frontal gyrus 5. Precentral sulcus 6. Longi- 
tudinal cerebral fissure 7. Precentral gyrus 
8. Postcentral gyrus 9. Central sulcus 10. 
Postcentral sulcus 11. Occipital pole. This 
illustration comes from the Virtual Hospi- 
tal 1221 . 



Volumetric medical images convey information about anatomical (MR, CT) 
or functional (PET, SPECT) symmetries and asymmetries, but they are hidden 
by the usual tilt of the patient’s head in the device during the scanning process. 
More precisely, the “ideal” coordinate system attached to the head, in which the 
inter-hemispheric fissure is conveniently displayed, differs from the coordinate 
system of the image by three angles around the bottom-top (yaw angle, axial 
rotation), the back-front (roll angle, coronal rotation) and the left-right (pitch 
angle, sagittal rotation) axes, and three translations along these directions (see 
Fig. El). It means that the fissure is generally not displayed in the center of the 
image lattice. This prevents from further visual inspection or analysis, because 
the homologous anatomical structures or functional areas in both hemispheres 
are not displayed in the same axial or coronal slice in the 3D image. 

It is of great interest to correctly reorient and recenter brain images, because 
normal (torque effect, intrinsic variability) and abnormal (unilateral pathologies) 
departures from symmetry appear more clearly and make the diagnosis easier in 
many cases: fractures of the skull in CT images, lesions, or bleed in MR images, 
asymmetries of perfusion in SPECT images, etc. Some diseases are assumed to 
be strongly linked with abnormalities of brain asymmetry, like schizophrenia: 
in this case, the brain is suspected to be more symmetrical than normal 0. 
After the initial tilt has been corrected, it is easier to perform further manual or 
automatic measurements to compare the two sides of the brain, because relative 
locations of homologous structures become immediate to assess mm- 
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Fig. 2. The “ideal” coordinate 
system attached to the head (in 
which the fissure is close to the 
plane Z = 0) and the coordinate 
system of the image are deduced 
to each other by way of three 
angles (yaw, roll and pitch) and 

a 3D translation (OO'). 



PITCH angle 



Several papers have previously considered the problem of correcting the axial 
and coronal rotations, and the translation along the left-right axis; we give a brief 
overview of the state-of-the-art in the next section. We do not tackle the problem 
of correcting the sagittal rotation (e.g., alignment along the AC-PC line) and 
the translations along the bottom-top and the back-front axes. 

1.2 Existing Methods 

Most of the existing algorithms share a common methodology. First, a suitable 
mid-sagittal plane is defined in the brain. Then, this latter is rotated and cente- 
red, so that the estimated plane matches the center of the image lattice. There 
are mainly two classes of methods, differing in their definition of the searched 
plane. We briefly describe their advantages and drawbacks in the following. 

Methods based on the inter-hemispheric fissure The basic hypotheses 
underlying these methods are that the inter-hemispheric fissure is roughly planar, 
and that it provides a good landmark for further volumetric symmetry analysis. 
Generally, the fissure is segmented in MR images, using snakes |01, or a Hough 
transform 0, and the plane best fitting the segmentation is estimated. As this 
approach focuses on the inter-hemispheric fissure, the resulting reorientation and 
recentering of the brain is insensitive to strong asymmetries. Conversely, as the 
global symmetry of the whole brain is not considered, the resulting algorithms 
are very sensitive to the often observed curvature of the fissure, which can lead 
to a meaningless plane (see Fig.QJ. At last, these methods are not adaptable to 
other modalities, where the fissure is not clearly visible. 

Methods based on a symmetry criterion There are relatively simple me- 
thods of finding a plane of reflective symmetry in case of perfectly symmetrical 
geometrical objects, in 2D or 3D. In this case, it can be demonstrated that any 
symmetry plane of a body is perpendicular to a principal axis. In case of medical 
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images, the problem is different, because normal and abnormal asymmetries de- 
viate the underlying symmetry of the brain: a perfect symmetry plane does not 
exist. To tackle this problem, an intuitive idea is to define the mid-sagittal plane 
as the one that maximizes the similarity between the image and its symmetric, 
i.e., the plane with respect to which the brain exhibits maximum symmetry. 
Practically, this approximate symmetry plane is to be close to the fissure, but is 
computed using the whole 3D image and no anatomical landmarks. 

Most often, the chosen similarity criterion is the cross correlation, computed 
between either the intensities |1 l7IMj or other features of the two symmetrical 
images with respect to a plane with given parameters. For example, the criterion 
can be computed between the derived Extended Gaussian Image (EGI) and 
its flipped version HZ): theoretically, if the brain is symmetrical, so is its EGI. 
Gontrary to the first class of methods, the whole 3D volume is taken into account, 
which means that the overall gross symmetry of the brain is used. Gonsequently, 
these methods are less sensitive to the variability of the inter-hemispheric fissure 
and its curved shape. The trade-off is the need for the criterion to be robust 
with respect to departures from the gross underlying cerebral symmetry, i.e., the 
normal and pathological asymmetries of the brain. This robustness is difficult to 
achieve with global criteria such as the cross correlation, that is affected in the 
same way by areas in strong (i.e., symmetrical) and weak (i.e., asymmetrical) 
correlation. These latter can severely bias the estimation of the plane Q. To 
overcome this issue, another similarity criterion is proposed in m-- the stochastic 
sign change, previously shown to be efficient in case of rigid registration, even 
for quite dissimilar images m- In the same way, a specific symmetry measure 
introduced in m considers mainly strongly symmetrical parts of the brain. 

One common drawback of these methods is the computational cost of the 
algorithms, due to the optimization scheme within the set of possible planes. 
However, this cost can be often reduced: the discretization of the parameters 
space (that limits the accuracy of the results) or a prior knowledge about the 
position of the optimal plane allow to investigate only a limited number of planes. 
Thus, the reorientation of the principal axes of the brain and the centering of 
its center of mass is often a useful preprocessing step. A multi-resolution scheme 
can also accelerate the process P . One important feature of these approaches is 
their ability to tackle other modalities than MR, in particular functional images. 

1.3 Overview of the Paper 

In this article, we present a new symmetry-based method allowing to compute, 
reorient and recenter the mid-sagittal plane in anatomical and functional images 
of the brain. This method, generalizing an approach we previously described 
in HM, is composed of two steps. At first, the computation of local rather 
than global similarity measures between the two sides of the brain allows to 
match homologous anatomical structures or functional areas, by way of a block 
matching procedure. The output is a set of point-to-point correspondences: the 
centers of homologous blocks. Subsequently, we define the mid-sagittal plane as 
the one best superposing the points in one hemisphere and their counterparts in 
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the other hemisphere by reflective symmetry. The estimation of the parameters 
characterizing the plane is performed by a least trimmed squares optimization 
scheme. Then, the estimated plane is aligned with the center of the image lattice. 
This method is fully automated, objective and reproducible. 

This approach deals with two severe drawbacks of classical symmetry-based 
methods. First, the computation of local measures of symmetry and the use of a 
robust estimation technique allow to discriminate between symmetrical and 
asymmetrical parts of the brain, these latter being naturally treated as outliers. 
Consequently, the computation of the mid-sagittal plane mainly relies on the 
underlying gross symmetry of the brain. Second, the regression step yields an 
analytical solution, computationally less expensive than the maximization of the 
global similarity measures described in Section I I .21 

We describe this approach in Section El In Section El we show that we can 
cope with strongly asymmetrical and tilted brains, even in presence of noise 
and bias, with very good accuracy and low computation time. In Section 0 we 
present results on anatomical (MR, CT) and functional (PET, SPECT) images. 

2 Description of the Method 

2.1 Presentation of the Main Principles 

We recall the principles of the method presented in . Given I, an MR image 

of the head, the mid-sagittal plane P is defined as the one best superposing the 
pairs {ui, Sp{bi)}, where is a brain voxel, its anatomical counterpart in 
the other hemisphere, and Sp the symmetry with respect to P. Practically, P is 
obtained by minimization of the least squares (LS) criterion ll®i ~ 5'p(6i)lP; 

1 1 . 1 1 is the Euclidian norm. An analytical solution of this problem is described in 
the appendix. The pairs {ai,bi} are obtained as follows (see also Fig. 0 ): 

— The mid-sagittal plane K of the image grid {K is fixed to the grid) differs 
from the searched mid-sagittal plane P of the brain in the tilt of the head 
during the scanning process, but is usually a good first estimate. The original 
image / is flipped with respect to K, yielding Sk{I)- 

— The “demons” algorithm HH| finds the anatomical counterpart 6' in Sk{I) 
of each point in /, by way of non-rigid registration between the 2 images. 

— bi = Sxib^) is the anatomical counterpart of in the other hemisphere. 
For example, in J, the point a^, located at the top of the right ventricle is 
matched with the point bi, located at the top of the left ventricle. 

Once P is computed, the transformation R = Sk ° Sp is a rotation if P and 
K are not parallel and a translation if P and K are parallel. The transformation 
when applied to the image I, automatically aligns the plane P with K 
PI- Several difficulties and limitations arise when using this method: 

— As many of the classical symmetry-based methods, normal and pathologi- 
cal asymmetries can severely disrupt the computation of the plane. Even 
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image I image I) 



Fig. 3. The non-rigid regi- 
stration strategy. The point 
h'i in Sk{I) is matched with the 
point Ui in /; bi = SKiK) is the 
counterpart of Oi in the other 
hemisphere. 



though it is based on local instead of global measures of symmetry, the LS 
minimization is not robust with respect to outliers uni, and will be strongly 
affected by the departures from the underlying symmetry. 

— The non-rigid registration algorithm will provide aberrant matchings when 
a structure is absent in one hemisphere (a lesion, one track of white matter, 
etc.), or when two structures are present but too different from each other; 
these failures are difficult if not impossible to detect. These meaningless cor- 
respondences can significantly affect the LS criterion and its minimization. 

— At last, the “demons” algorithm mainly relies on the gradient of the image, 
and proved to be efficient for low-textured images like MR or CT. Conse- 
quently, this approach is not applicable to SPECT or PET images. 



2.2 Modification Based on a Block Matching Strategy and a Robust 
Estimation Technique 

We propose a modification of this approach, allowing to compute the mid-sagittal 
plane mainly from correspondences between very symmetrical areas, and to 
tackle both functional and anatomical images. The methodology is twofold: we 
still find point-to-point correspondences between the two sides of the brain, and 
then derive the plane best superposing the pairs of matched points, but the 
matching and the optimization procedures significantly differ from Section IZ. 11 



Computation of inter-hemispheric correspondences by a block mat- 
ching strategy. The pairs of correspondences {ai,b[} are obtained by way of 
a block matching strategy between the image / and its symmetric Sk{I)- This 
procedure is extensively described in in case of rigid registration of anato- 
mical sections. The common lattice of the 2 images (of size X xY x Z) defines a 
set of rectangular parallelepipedic blocks of voxels {B} in I and {B'} in Sk{I), 
given their size Nj, x Ny x N^'. both images contain {X — -I- 1) x (E — Ny -I- 1) x 

{Z — Nz + 1) such blocks. We aim at matching each block in {B} with the block in 
{B'} maximizing a given similarity measure, which yields a “displacement field” 
between / and Sk{I)- Practically, it is not computationally feasible to make an 
exhaustive search of matchings within {B'} for each block of {B}. In addition, 
we have an a priori knowledge about the position of the correspondent B' of B: 
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if the head is not too tilted, B' is to be located in a neighborhood of B. Thus we 
constrain the search procedure to subsets defined as follows: 

— We limit the search for correspondences to one block B every (resp. 
Ay, Az) voxels in the x (resp. y, z) direction, defining a subset of {B}; 
A = (Ax, Ay, Az) determines the density of the computed “displacement 
field” between I and Sk(I)- 

— For each block B in this subset, we define a sub-image in Sk(I), centered on 
B, which delimits a neighborhood of research. This sub-image is composed of 
the voxels in Sk(I) located within a distance of f2x (resp. Qy, fiz) voxels in 
the X (resp. y, z) direction from B. This yields a rectangular parallelepipedic 
sub-image of size (Nx + 2Qx) x (Ny + 2Qy) x (Nz + 2Qz) in Sk(I), which 
contains (2f2x + ^) x (2f2y + l) x (2122-1-1) blocks B' (provided this sub-image 
is entirely located in Sk(I))- 

— In this sub-image, we examine one block B' every Sx (resp. Sy, Sz) voxels 
in the x (resp. y, z) direction; E = (Ex, Ey, Ez) determines the resolution 
of the displacement field. 

Note that the subset of {B} in / and the subset of {B'} in the sub-image of 
Sk(I) contain the following number of blocks, respectively: 

TCifi,yi{nx\(nx - l)Ax + Nx < X} max{nx|(na, - l)Ex < 2f?x} 

X max{nyl(ny — l)Ay + Ny < ¥} and x max{nyj(ny — l)Ey < 2f?y} 

X max{ rizKnz — l)Az + Nz < Z} x max{nz|(nz — l)Ez < 2Qz} 

We note B^fe (resp. B'lnm) the block in / (resp. Sk(I)) containing the voxel 
(i,j, k) (resp. (I, n, m)) at its top left back corner. We summarize the features of 
the algorithm as follows: 

— For (z = 0; z < X — Nx', i = i + Ax) 

— For (j = 0;j <Y - Ny; j =j + Ay) 

— For (k = 0;k < Z — Nz] k = k + Az) 

— We consider the block B^^ in / 

— For (I = i — fix', I E i + ^x', 1 = 1 + Ex) 

— For (m = j — fly', m < j + fly; m = m + Ey) 

— For (n = k — flz', n < k + flz', n = n + Ez) 

— If the block B'lnm in Sk(I) is entirely located in the image lattice, we 
compute a similarity measure with B^z, 

— We retain the block B'lnm with maximal similarity measure, which defines 
the displacement vector between the center (z -|- Nx/2,j + Ny/2, k + Nz/2) 
of Bijk and the center (I + Nx/2, n + Ny/2,m + Nz/2) of B'lnm- 

A given choice of parameters N = (Nx, Ny, Nz), fl = (fix, fly, flz), N = 
(Ax, Ay, Az), E = (Ex, Ey, Ez), whose interpretation will be given later, yields 
pairs of correspondences (az, b() between / and Sk(I), + and b( being the centers 
of matched blocks. The output of this scheme is a displacement field, which 
conveys local information about brain symmetry or asymmetry. The points {b(} 
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image I image Sg(I) 



Fig. 4. The block matching. The point fe' in Sk{I) is homologons to the point Oi in 7; 
bi = SK{h'i) is the counterpart of at in the other hemisphere. 7 contains 128^ voxels. The 
chosen parameters are: N = (32,32,32), A — (8,8,8), 17 = (8,8,8), E — (16,16,16). 
In 7, the snbset of {B} is defined by the dashed grid (parameters A). Aronnd the 
block of center ai, snperposed on Sk{I) with dotted lines, a neighborhood of research 
is delimited (parameters 17). In this sub-image of Sk{I), the search is completed on 
the snbset of {B'| defined by the small dashed grid (parameters E). For each of the 
13^ = 2197 such defined blocks in 7, the search is done on 2^ = 8 blocks in Sk{I)- 



are then flipped back with respect to K, giving the points {bi = S'iy(6')}; bi is 
the counterpart of ai in the opposite side of the brain (see Fig.^. 

Different intensity-based criteria can be chosen as a similarity measure, such 
as the Correlation Coefficient (CC) Pj, the Correlation Ratio (CR) [O] or the 
Mutual Information (MI) |2 1 19j . Each of these measures assumes an underlying 
relationship between the voxel intensities of the 2 images, respectively affine 
(CC), functional (CR), or statistical (MI) [T3|- Practically, the CR and the MI are 
well suited to multimodal registration, whereas the CC is suited to monomodal 
registration. In our case, I and Sk{I) have the same “modality”: an affine, or 
locally affine relationship can be assumed, and we use the CC. 

This block matching approach, based on local similarity measures, allows 
to exclude very asymmetrical and meaningless areas from the computation of 
the plane. First, if no block B' in the subset defined in Sk{I) exhibits a high 
— CC — with a given block B in the subset defined in 7, its center is eliminated 
straightforwardly, by setting a convenient threshold. In practice, this happens 
when the structures existing in one given block in I are absent from any block 
in Sk{I)j which is the case for strongly asymmetrical areas. This elimination is 
not easily feasible in , where it is difficult to detect where the non-rigid al- 

gorithm fails. Thus, the estimation step, performed with these preselected inter- 
hemispheric correspondences, is mainly based on symmetrical areas. The robust 
estimation technique we use (a least trimmed squares minimization) allows to 
exclude the remaining asymmetrical areas from the computation of the plane. 

Robust estimation of the mid-sagittal plane. A least trimmed squares 
(LTS) strategy is used to And the plane P best superposing the points {oi} and 
their counterparts {bi}. This minimization scheme has been proven to be far more 
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robust to outliers than the classical LS method M- In our problem, we have 
to deal with two kinds of outlying measures. First, aberrant matchings can be 
obtained if the head is strongly tilted. Second, even after the initial short-listing 
that eliminates blocks with low — CC — , blocks conveying strong asymmetries 
can remain. This happens when a structure is present in both hemispheres, but 
in different locations: the two matched blocks containing this structure are likely 
to exhibit a high — CC — . The use of a robust estimation technique enables the 
computed plane to be only based on the underlying gross symmetry of the brain, 
the asymmetries being treated as outliers. The LTS scheme we use is: 

— The plane P minimizing Iki ~ *5'p(6i)|p is computed (see Appendix). 

— The residuals r, = ||ai — iS'p(&i)|| are trimmed, and P is recomputed as 
previously, using only the voxels i with the 50% smaller residuals. 

— After several iterations, the scheme stops when the angle between the normal 
vectors of two successively estimated planes is lower than a fixed threshold; 
we consider that they are “sufficiently close” to each other. 

This strategy is able to cope with up to 50% of outliers To improve 
the accuracy of the estimation, we iterate the process (Fig. EJ. As previously 
noted, after a first estimation P\ of the mid-sagittal plane, the transformation 
Ri ={SkoSp, is such that Pi = K in R{I) (we recall that K is fixed to 
the image grid). We make a new block matching between Ri{I) and Sk{Ri{I)), 
K being the firstly estimated plane Pi, and a new estimation P 2 by the LTS 
procedure. The transformation R 2 = {Sk ° is such that P 2 = K in 

R 20 Ri{I)). After several iterations, the mid-sagittal plane is computed from 
the image (i?„_io...oPi)(J). The final estimate is the plane K in (i?„o...oi?i)(J). 
The composition of the successively estimated rigid transformations Ri avoids 
multiple resampling. Usually, we choose a fixed number of iterations. 



beginning 

( ] 

I symmetry w.r.t. K 




Fig. 5. General scheme. 

We describe the iterative pro- 
cess for one given choice of 
parameters (i.e., at one gi- 
ven scale). Usually, we fix 
the number of iterations: ty- 
pically, Umax = 5 yields good 
results (see Section 0. 
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Multiscale scheme. Given a set of parameters N, 17, A, E, the complexity of 
the block matching process is proportional to [ • Intuitively, 
when I is strongly tilted, I and Sk{I) are far from each other, and the neighbor- 
hood of research must be large (parameters E), to deal with strong differences 
in translation and rotation. We also expect large blocks (parameters N) to give 
more sensible CC than small ones. On the contrary, when / is already well alig- 
ned, we can restrict the neighborhood of research, and have more confidence in 
the CC computed on small blocks. 

We implemented a multiscale scheme to achieve a good trade-off between 
accuracy and complexity. Initially, when the head is suspected to be strongly 
tilted, we make a first estimation of the mid-sagittal plane with large values of 
N, 17, A, E. This raw estimate P^, based on a displacement field with low density 
and low resolution, is the center of (i?^o...oi?^)(/) (n is the number of iterations 
at a given scale). Then, we decrease the parameters so that the complexity 
remains constant: the new estimate is the center of {P(^o . . .oP? oR\o . . ,oR^)[I) ^ 
and so on. At the last scale, the estimation is based on a displacement field of 
high density and high resolution, and is likely to be accurate. Usually, we make 
the following choices, for isotropic as well as anisotropic images: 

— The initial values of the parameters are: 

- N = ([A/4], [y/4], [Z/4]) or N = ([A/8], [F/8], [Z/8]) (see Section0) 

- f2 = N, A = E = A/4 

— At each iteration, they are automatically updated as follows: 

A ^ A/2, 17 ^ 17/2, Z\ ^ A/2, E ^ E/2. 

— The updating in the direction x (resp. y, z) stops when (resp. Ny, N^) 
is smaller than 4 at the next scale. At this level, the small block size makes 
the computed CC become meaningless. The whole process stops when there 
is no updating in any direction. For an image of size 128^ and for each of 
the 2 choices we usually make for initial parameters, we get 4 and 3 scales 
respectively, and A = E = (1,1,1) at the last scale: this means that we 
obtain a displacement field of very high density and resolution. 

3 Validation: Robustness and Accuracy Analysis 

3.1 Materials 

In this section, we present a series of experiments on simulated data, to show 
the robustness and the accuracy of the algorithm. Moreover, we aim at finding a 
set of optimal parameters for the computation of the plane and showing that the 
algorithm is robust with respect to a relatively high level of noise and bias. This 
simulated dataset contains 1152 synthetic MR images, generated as follows. 

First, a perfectly symmetrical image Ji is created. We consider an original 
MR image I of size 256^, with voxel size 0.78mm^, provided by Dr. Neil Roberts, 
Magnetic Resonance and Image Analysis Research Centre (University of Liver- 
pool, UK). Running our algorithm on very high resolution images implies a 
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prohibitive computation time; we resample / to get a new image of size 128^. In 
this latter, a mid-sagittal plane is determined by visual inspection, and matched 
with the center of the image grid. One half of the brain is removed; the other 
one is flipped with respect to the center, perfect symmetry plane of this new 
image I\, which constitutes the ground truth for our validation experiments. 

Second, artificial lesions with different grey levels and local expansions and 
shrinkings are added inside the brain to create strong focal asymmetries. Third, 
an additive, stationary, Gaussian white noise (cr = 3) is added, on top of the int- 
rinsic noise in Ii. Fourth, a roll, a yaw angle and a translation along the left-right 
axis are applied. We choose the angles in the set {0, 3, 6,..., 21} (in degrees), 
and the translations in the set {0, 4, 8 ,..., 20} (in voxels): the 384 possible com- 
binations constitute the dataset A; the applied noise is different for each image. 
Resampling Ji to the size 64^ gives the image l 2 - Adding the same lesions and 
deformations, random noise with the same characteristics, and applying the same 
rotations, and translations of 0, 2,..., 10 voxels, we get a second dataset (B) of 
384 images (the transformation with parameters (yaw,roll,translation)=(a;, /3, 2t) 
applied to I\ and applied to I 2 are the same). At last, a strong mul- 

tiplicative bias field (linear in x, y and z) is added to I 2 before applying the 
transformation, which creates a third dataset (C) of 384 images. In brief: 

— dataset A: /i -f lesions -I- deformations -I- noise -I- 2 rotations -I- 1 translation 

— dataset B: /2 -I- lesions -I- deformations -I- noise -I- 2 rotations -I- 1 translation 

— dataset C: I 2 + lesions -I- deformations -I- noise -I- bias -I- 2 rotations -I- 1 
translation 



3.2 Methods 

The following experiments are devised (with Umax = 5 iterations at each scale): 

— Experiment 1: dataset A with {N, 17, A, S) = (32, 32, 8, 8) 

— Experiment 2: dataset A with (A, 17, A, S) = (16, 16, 4, 4) 

— Experiment 3: dataset B with (A, J7, A, S) = (16, 16, 4, 4) 

— Experiment 4: dataset C with (A, 17, A, E) = (16, 16,4,4) 

For each experiment, the computed roll, yaw angles and translation along the 
left-right axis aligning the estimated mid-sagittal plane are compared with the 
applied ones, giving a measure of accuracy of the algorithm. For this purpose, 
the computed rigid transformation is composed with the applied one. The norm 
of the yaw and roll angles of the rotation component of this composition and the 
norm of its translation component along the left-right axis are computed; the 
closer to zero these 3 parameters are, the more accurate the result is. Another 
measure of accuracy e is described in Fig. Q We consider that an experiment 
is successful when e is lower than a given threshold, typically, 1 voxel. The 
maximal value Smax of S (which measures the initial tilt of the head) for which 
the algorithm succeeds gives an idea of the robustness of the algorithm. 
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Fig. 6. Realignment of a synthetic MR image. Artificial lesions, local deforma- 
tions, noise and bias are added to a perfectly symmetrical MR image of size 128®. 
Roll and yaw angles of 6 degrees, and translation along the left-right axis of 6 voxels 
are applied to this image. The initial parameters of the block matching algorithm are 
{N, n, A, E) = (16, 16,4,4). The errors of the compnted transform, compared to the 
applied one are: 4.10“^ degrees (roll angle), 3.10”^ degrees (yaw angle), 10“® voxels 
(translation), and 2.10~® voxels (error e, see Fig. [^l. We display 2 panels with axial 
(left) and coronal (right) views. In each panel, from left to right, we have the original 
image with added lesions and deformations, the tilted image with added noise and bias, 
and the realigned and recentered image. 




64 £ 4 

Fig. 7 . A measure of accuracy. A synthetic image I is generated, in which the 
central plane P is the sought symmetry plane of the brain, as described in the text 
(left sketch). We apply yaw, roll angles, and a translation along the left-right axis, 
which yields a rigid transformation Ri. In Ri{I), the real symmetry plane Ri{P) is 
no longer aligned with the center of the image grid (central sketch). The maximnm 
5 of the four distances 5i, <52, da, ^4 measures the tilt of the head in Ri{I) before we 
rnn the algorithm. We estimate a symmetry plane P and a rigid transformation R 2 so 
that P is displayed in the center of R 2 o Ri{I) (right sketch). The estimated plane P is 
generally different from the real one R 2 ° Ri{P). The maximnm e of the four distances 
Cl, f 2 , £ 3 , £4 gives a good idea of the maximal error in the whole volume of the image. 
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Comparing the experiments 1 and 2 (resp. 1 and 3) shows the influence of 
the initial size of the blocks (resp. subsampling) on the accuracy, the robustness 
and the computation time of the algorithm. This aims at indicating which set of 
parameters is best adapted to real medical images. Comparing the experiments 
3 and 4 shows the sensitivity of the algorithm to bias effects. The experiments 
were led on a standard PC (OS Linux), 450 MHz, 256 MBytes of RAM. 

3.3 Results and Interpretation 

Experiment 1 vs 2. The algorithm proved to be highly robust for the ex- 
periment 1. It never failed when 5 was lower than 51 voxels, which corre- 
sponds (for example) to parameters (yaw,roll,translation)=(15, 15, 16), (18, 18, 8) 
or (21, 21, 0). In real images, the tilt of the head is usually smaller. We noticed 
that the convergence of the algorithm is the same for parameters (a, /3, t) and 
(/3, a, t ) : the yaw and roll angles play symmetric roles. Note that this convergence 
is not deterministic in our experiments, because the random noise is added sepa- 
rately on each image of the datasets. Thus, the algorithm did not fail systemati- 
cally for more extreme parameters; for example, it succeeded for the parameters 
(21,21,20). For the experiment 2, the rate of success is significantly reduced: it 
systematically succeeded when <5 was lower than 42, which approximately cor- 
responds to parameters (12,12,16), (15,15,18) or (18,18,0). The small initial 
block size and the restricted neighborhood of research explain that the algorithm 
is unable to deal with too tilted heads. Compared to experiment 1, there is one 
less scale to explore, and the average computation time is reduced, but still pro- 
hibitive (about 34 min). The obtained accuracy is about the same compared to 
experiment 1. Thus, the set of initial parameters N = ([A/4], [T/4], \Z/A\) seems 
to be best adapted at a given resolution of the image. 

Experiment 1 vs 3. For these two datasets, studied with optimal initial block 
size, the robustness is about the same, surprisingly. The subsampling does not 
reduce significantly the efficiency of the algorithm, which can fail when 5 is 
superior to 25 voxels, which corresponds to parameters (15, 15, 8), (18, 18, 4) or 
(21,21,0), comparable with the parameters of experiment 1. The accuracy is 
divided by two in experiment 3 compared to experiments 1 and 2, but remains 
very high (see Table EJ. At last, the computation time is strongly reduced (by a 
factor of 10). This suggests that highly subsampled images (from 256^ to 64^) 
are enough for a satisfying reorientation and recentering. 

Experiment 3 vs 4. The algorithm is very robust with respect to a relatively 
high bias. This is an important feature of this local approach. Locally, the inten- 
sity variations are smaller than on the whole image, and the CC is still a sensible 
measure. The accuracy and the computation time are similar. 



Conclusion. We draw several conclusions from these experiments: the accuracy 
is always very high when the algorithm succeeds. For a usual MR image of size 
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256^, with voxel size 0.78mm^ and with an initial tilt of S lower than about 
100 voxels, which corresponds to realistic conditions of i5 = 50 (resp. 25) voxels 
in the subsampled image of size 128^ (resp. 64^), our algorithm is likely to 
succeed. Using the subsampled image of size 64^, with (16,16,4,4) as initial 
parameters and 5 iterations at each scale, we reach a precision of about e = 
10.10“^ X 4 X 0.78 ~ 0.3 mm (see Tableland Fig.|7I) for successful experiments, 
within a CPU time of about 3 minutes. For strongly tilted images, an initial 
alignment along the principal axes of the brain can be a useful preprocessing. 



Table 1. Validation on simulated data. The RMS errors (indicated for successful 
experiments only) are measured in degrees for the angles and voxels for the left-right 
translation and the value e (see Fig.Ql. The errors are doubled between experiments 
on 128® and 64® images, including for the translation and e (the errors in voxels are 
about the same, and the errors in mm are doubled for half resolution images). 



Exp. 


Robustness 

(^max ) 


Accuracy (RMS errors) 


CPU 

Time 


Roll Angle 


Yaw Angle 


Translation 


€ 


1 


51 voxels 


4.10"'' 


4.10"" 


5.10"" 


11.10"" 


45' 


2 


42 voxels 


3.10"" 


4.10"" 


5.10"" 


10.10"" 


34' 


3 


25 voxels 


11.10"" 


9.10"" 


6.10"" 


13.10"" 


3’ 


4 


25 voxels 


11.10"" 


8.10"" 


7.10"" 


11.10"" 


3' 



4 Results and Acknowledgements 

In this section, we present results for real anatomical (MR, CT) and functional 
(SPECT, PET) images. For each illustration, we present axial (top) and coronal 
(bottom) views, for the initial 3D image (left) and the reoriented and recente- 
red version (right) (see Fig. EJ- The MR image has been provided by Dr. Neil 
Roberts, Magnetic Resonance and Image Analysis Research Centre (University 
of Liverpool, UK), and is of size 256®, with voxel size 0.78mm®. The CT image 
comes from the Radiology Research Imaging Lab (Mallinckrodt Institute of Ra- 
diology, Saint Louis, Missouri, USA), and is of size 256 x 256 x 203, with voxel 
size 0,6mm®. The SPECT image has been provided by Pr. Michael L. Goris, 
Department of Nuclear Medicine (Stanford University Hospital, USA), and is 
of size 64®. At last, the PET image has been provided by the Hammersmith 
Hospital in London, UK, and the Unite 230 of INSERM, Toulouse, France. It is 
of size 128 x 128 x 15, with voxel size 2.05mm x 2.05mm x 6.75mm. 

5 Conclusion 

We have presented a new symmetry-based method allowing to compute, reorient 
and recenter the mid-sagittal plane in volumetric anatomical and functional ima- 
ges of the brain. Our approach relies on the matching of homologous anatomical 
structures or functional areas in both sides of the brain (or the skull), and a 
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Fig. 8. Results on real images. From left to right, top to bottom: isotropic MR, 
CT, SPECT images, and anisotropic PET image. See Section 0 for details. 



robust estimation of the plane best superposing these pairs of counterparts. The 
algorithm is iterative, multiscale, fully automated, and provides a useful tool for 
further symmetry-based analysis of the brain. We showed on a large database 
of synthetic images that we could obtain a subvoxel accuracy in a CPU time 
of about 3 minutes for strongly tilted heads, noisy and biased images. We have 
presented results on isotropic or anisotropic MR, CT, SPECT and PET images; 
the method will be tested on functional MR and ultrasound images in the future. 

Appendix: LS Estimation of the Mid-Sagittal Plane 

We want to minimize C = with S{hi) = bi~ 2((bi —p)^n)n and 

where p is a point in the plane and n the unit normal vector to the plane. By 
differentiating C with respect to p, we get: 
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^ = 4^(2p -bi- a^ynn^ 

i 

which demonstrates that the barycenter G = ^ (bt+a,) j^gj^Qjjgg plane. 

Substituting G in the first equation, we get: 

C = '^{bi- Oi)^ + 4[{bi - G)^n] [(oj - G)^n] 

i 

which is minimized when the following expression is minimized: 

- G){bi - G)^]n 

i 

which means than n is the eigenvector associated to the smallest eigenvalue of 
I, where: 



I = Y,ia^-G){b,-G)^ 

i 
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Abstract. A probabilistic method for tracking 3D articulated human 
figures in monocular image sequences is presented. Within a Bayesian 
framework, we define a generative model of image appearance, a robust 
likelihood function based on image graylevel differences, and a prior pro- 
bability distribution over pose and joint angles that models how humans 
move. The posterior probability distribution over model parameters is 
represented using a discrete set of samples and is propagated over time 
using particle filtering. The approach extends previous work on para- 
meterized optical flow estimation to exploit a complex 3D articulated 
motion model. It also extends previous work on human motion tracking 
by including a perspective camera model, by modeling limb self occlusion, 
and by recovering 3D motion from a monocular sequence. The explicit 
posterior probability distribution represents ambiguities due to image 
matching, model singularities, and perspective projection. The method 
relies only on a frame-to-frame assumption of brightness constancy and 
hence is able to track people under changing viewpoints, in grayscale 
image sequences, and with complex unknown backgrounds. 



1 Introduction 

We present a Bayesian approach to tracking 3D articulated human figures in mo- 
nocular video sequences. The human body is represented by articulated cylinders 
viewed under perspective projection. A generative model is defined in terms of 
the shape, appearance, and motion of the body, and a model of noise in the pixel 
intensities. This leads to a likelihood function that specifies the probability of 
observing an image given the model parameters. A prior probability distribution 
over model parameters depends on the temporal dynamics of the body and the 
history of body shapes and motions. With this likelihood function and temporal 
prior, we formulate the posterior distribution over model parameters at each 
time instant, given the observation history. 

The estimation of 3D human motion from a monocular sequence of 2D ima- 
ges is challenging for a variety of reasons. These include the non-linear dynamics 
of the limbs, ambiguities in the mapping from the 2D image to the 3D model, the 
similarity of the appearance of different limbs, self occlusions, kinematic singula- 
rities, and image noise. One consequence of these difficulties is that, in general, 
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we expect the posterior probability distribution over model parameters to be 
multi-modal. Also, we cannot expect to find an analytic, closed-form, expression 
for the likelihood function over model parameters. For these two reasons, we 
represent the posterior distribution non-parametrically using a discrete set of 
samples (i.e., states), where each sample corresponds to some hypothesized set 
of model parameters. Figuren(fl) illustrates this by showing a few samples from 
such a distribution over 3D model parameters projected into an image. This 
distribution is propagated in time using a particle filter [um- 

The detection and tracking of human motion in video has wide potential for 
application in domains as diverse as animation and human-computer interaction. 
For this reason there has been a remarkable growth in research on this problem. 
The majority of proposed methods rely on sources of information such as skin 
color or known backgrounds which may not always be available. Such cues, 
while useful, are not intrinsic to 3D human motion. We focus, instead, on the 
3D motion of the figure and its projection into the image plane of the camera. 
This formulation, in terms of image motion, gives the tracker some measure of 
independence with respect to clothing, background clutter, and ambient lighting. 
Additionally, the approach does not require color images, nor does it require 
multiple cameras with different viewpoints. As a consequence, it may be used 
with archival movie footage and inexpensive video surveillance equipment. The 
use of perspective projection allows the model to handle significant changes in 
depth. Finally, unlike template tracking methods |^, the use of image motion 
allows tracking under changing viewpoint. These properties are illustrated with 
examples that include tracking people walking in cluttered images while their 
depth and orientation with respect to the camera changes significantly. 

2 Related Work 

Estimation of human motion is an active and growing research area |H| . We briefly 
review previous work on image cues, body representations, temporal models, and 
estimation techniques. 

Image Cues. Methods for full body tracking typically use simple cues such as 
background difference images 0 , color [221 or edges mum . However robust, 
these cues provide sparse information about the features in the image. Image 
motion (optical flow) [^tII 4124] provides a dense cue but, since it only exploits 
relative motion between frames, it is sensitive to the accumulation of errors over 
multiple frames. The result is that these techniques are prone to “drift” from 
the correct solution over time. The use of image templates ^ can avoid this 
problem, but such approaches are sensitive to changes in view and illumination. 
Some of the most interesting work to date has combined multiple cues such as 
edges and optical flow m- The Bayesian approach we describe may provide a 
framework for the principled combination of such cues. 

The approach here focuses on the estimation of 3D articulated motion from 
2D image changes. In so doing we exploit recent work on the probabilistic esti- 
mation of optical flow using particle filtering m- The method has been applied 
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to non-linear spatial and temporal models of optical flow, and is extended here 
to model the motion of articulated 3D objects. 

Body and Camera Models. Models of the human body vary widely in their 
level of detail. At one extreme are methods that crudely model the body as 
a collection of articulated planar patches 1 1 4l24j . At the other extreme are 3D 
models in which the limb shapes are deformable I3IS1- Additionally, assumptions 
about the viewing conditions vary from scaled orthographic projection |S| to full 
perspective \Z I WZ!r>\ . To account for large variations in depth, we model the body 
in terms of articulated 3D cylinders na viewed under perspective projection. 

Temporal Models. Temporal models of body limb or joint motion also vary in 
complexity; they include smooth motion 0, linear dynamical models m , non- 
linear models learned from training data using dimensionality reduction 
and probabilistic Hidden Markov Models (HMM’s) (e.g., 0). In many of these 
methods, image measurements are first computed and then the temporal models 
are applied to either smooth or interpret the results. For example, Leventon and 
Freeman m proposed a Bayesian framework for recovering 3D human motion 
from the motion of a 2D stick figure. They learned a prior distribution over 
human motions using vector quantization. Given the 2D motion of a set of joints, 
the most plausible 3D motion could be found. They required a pre-processing 
step to determine the 2D stick figure motion and did not tie the 3D motion 
directly to the image. Their Bayesian framework did not represent multi-modal 
distributions and therefore did not maintain multiple interpretations. 

Brand ^ learned a more sophisticated HMM from the same 3D training 
data used in m- Brand’s method used binary silhouette images to compute a 
feature vector of image moments. The hidden states of the HMM represented 3D 
body configurations and the method could recover 3D models from a sequence 
of feature vectors. These weak image cues meant that the tracking results were 
heavily dependent on the prior temporal model. 

Unlike the above methods, we explore the use of complex non-linear tempo- 
ral models early in the process to constrain the estimation of low-level image 
measurements. In related work Yacoob and Davis m used a learned “eigen- 
curve” model of image motion m to constrain estimation of a 2D articulated 
model. Black used similar non-linear temporal models within a probabilistic 
framework to constrain the estimation of optical flow. 

Estimation. Problems with articulated 3D tracking arise due to kinematic sin- 
gularities irzi> depth ambiguities, and occlusion. Multiple camera views, special 
clothing, and simplified backgrounds have been used to ameliorate some of these 
problems 02CS|. In the case of monocular tracking, body parts with low visi- 
bility (e.g. one arm and one leg) are often excluded from the tracking to avoid 
occlusion effects and also to lower the dimensionality of the model j^]. Cham 
and Rehg jS| avoid kinematic singularities and depth ambiguities by using a 2D 
model with limb foreshortening HZ|. They also employ a multi-modal tracking 
approach related to particle filtering. 
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Bregler and Malik 0 assumed scaled orthographic projection and posed the 
articulated motion problem as a linear estimation problem. Yamamoto et al. 
also formulated a linear estimation problem and relied on multiple camera views. 
These approaches elegantly modeled the image motion but did not account for 
imaging ambiguities and multiple matches. 

Recently, Deutscher et al. |7] showed promising results in 3D tracking of 
body parts using a particle filtering method (the Condensation algorithm). 
They successfully tracked an arm through kinematic singularities. We address 
the singularity problems in the same way but focus on image motion rather 
than edge tracking. We also employ learned temporal models to compensate for 
depth ambiguities and occlusion effects, and we show tracking results with more 
complex full-body motions. 



3 Generative Model 

A Bayesian approach to human motion estimation requires that we formulate a 
generative model of image appearance and motion. This model defines the state 
space representation for humans and their motion and specifies the probabilistic 
relationship between these states and observations. The generative model of 
human appearance described below has three main components, namely, shape, 
appearance, and motion. The human body is modeled as an articulated object, 
parameterized by a set of joint angles and an appearance function for each of 
the rigid parts. Given the camera parameters and the position and orientation 
of the body in the scene, we can render images of how the body is likely to 
appear. The probabilistic formulation of the generative model provides the basis 
for evaluating the likelihood of observing image measurements. It at time t, given 
the model parameters. 

3.1 Shape: Human Body Model 

As shown in Figure ^ the body is modeled as a configuration of 9 cylinders and 
3 spheres, numbered for ease of identification. All cylinders are right-circular, 
except for the torso which has an elliptical cross-section. More sophisticated 
tapered cylinders IZEIl or superquadrics |H| could be employed. Each part is 
defined in a part-centric coordinate frame with the origin at the base of the 
cylinder (or sphere). Each part is connected to others at joints, the angles of 
which are represented as Euler angles. The origin in each part’s coordinate frame 
corresponds to the center of rotation (the joint position) . 

Rigid transformations, T, are used to specify relative positions and orientati- 
ons of parts and to change coordinate frames. We express them as a homogeneous 
transformation matrices: 

rri HzHyHa; f \ 

[ 0 ij 

where R^,, Rj, and R^ denote 3x3 rotation matrices about the coordinate axes, 
with angles Ox, Oy and 9z, and t = \Tx,Ty,TzY' denotes the translation. 
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Fig. 1. (a) A few samples from a probability distribution over 3D model parameters 
projected into the image coordinate system, (b) Human body model. Each limb, i, has 
a local coordinate system with the Zi axis directed along the limb. Joints have up to 
3 angular DOF, expressed as rotations {6x,0y,6A)- 



A kinematic tree, with the torso at its root, is used to order the transforma- 
tions between the coordinate frames of different limbs. For example, in Figure 
m>, the point Pi in the local coordinate system of limb 1 (the right thigh) can 
be transformed to the corresponding point Pg in the global coordinate system 
as Pg = To,gTi^oPi- The global translation and rotation of the torso are repre- 
sented by To,g, while the translation and rotation of the right thigh with respect 
to the torso are represented by Ti^o- 

With these definitions, as shown in Figure©, the entire pose and shape of 
the body is given by 25 parameters, that is, angles at the shoulders, elbows, hips 
and knees, and the position and orientation of the torso in the scene. Let <f> be 
the vector containing these 25 parameters. 

Camera Model. The geometrical optics are modeled as a pinhole camera, 
with a transformation matrix Tc defining the 3D orientation and position of 
a 3D camera-centered coordinate system with a focal length / and an image 
center c = [Xc,Uc]'^- The matrix maps points in scene coordinates to points in 
camera coordinates. Finally, points in 3D camera coordinates are projected onto 
the image at locations, x = [a;, y]'^ , given by x = c — f[^, ^]^- 

3.2 Appearance Model 

For generality, we assume that each limb is textured mapped with an appea- 
rance model, R. There are many ways in which one might specify such a model, 
including the use of low-dimensional linear subspaces m- Moreover, it is desi- 
rable, in general, to estimate the appearance parameters through time to reflect 
the changing appearance of the object in the image. Here we use a particularly 
simple approach in which the appearance function at time t is taken to be the 
mapping, M(-), of the image at time t—1 onto the 3D shape given by the shape 
parameters at time t—1: 



Rt — . 
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In probabilistic terms, this means that the probability distribution over ap- 
pearance functions at time t, conditioned on past shapes = [4>t-i^ ■ • ■ ) ^o]> 
past image observations, It_i = . . . , Iq], and past appearance functions 

Rt_i = [Rt-i, . . . ,Ro], is given by 

p(Rt I It_i, Rt_i) = p{Rt I It_i, = (5(Rt - , (2) 

where <5(-) is a Dirac delta function. 

Our generative model of the image, I*, at time t is then the projection of the 
human model (shape and appearance) corrupted by noise: 

M^j) = (3) 

where M“^(Rt, maps the 3D model of limb j to image location xj and 

It (xj ) is the image brightness at pixel location xj . To account for “outliers” , the 
noise, rj, is taken to be a mixture of a Gaussian and a uniform distribution 

= (l-e)G(cr(a(xj, </>())) -he c, 

where 0 < e < 1 and c = 1/256. The uniform noise is bounded over a finite inter- 
val of intensity values while G(-) is zero- mean normal distribution the variance 
of which may change with spatial position. In general, the variance is sufficiently 
small that the area of the Gaussian outside the bounded interval may be ignored. 

The prediction of image structure. It, given an appearance model, Rt, esti- 
mated from the image at time t — 1 will be less reliable in limbs, or regions 
of limbs, that are viewed obliquely compared with those that are nearly fronto- 
parallel. In these regions, the image structure can change greatly from one frame 
to the next due to perspective distortions and self occlusion. This is captured 
by allowing the variance to depend on the orientation of the model surface. 

Let a{xj,4>t) be a function that takes an image location, Xj, and projects 
it onto a 3D limb position P and returns the angle between the surface normal 
at the point P and the vector from P to the focal point of the camera. The 
variance of the Gaussian component of the noise is then defined with respect to 
the expected image noise, cr/, which is assumed constant, and a{xj, (p^): 

a'^{a{xj,cpt)) = {ai/ cos{a{xj,(ptW ■ ( 4 ) 



3.3 Temporal Dynamics 

Finally we must specify the temporal dynamics as part the generative model. 
Towards this end we parameterize the motion of the shape in terms of a vector 
of velocities, Vj, whose elements correspond to temporal derivatives of the shape 
and pose parameters in <p. Furthermore, we assume a first-order Markov model 
on shape and velocity. Let the entire history of the shape and motion parameters 
up to time t be denoted by (pt = [<PtJ • ■ • > 4>o\ = [V*, . . . , Vq]. Then, the 

temporal dynamics of the model are given by 

p{cPt\^t-i,^t-i) = p{cPt\cPt-i,yt-i) , ( 5 ) 
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p(V*|0,_i,V*_i)=p(V*|V*_i) . (6) 

Humans move is a variety of complex ways, depending on the activity or ge- 
stures being made. Despite this complexity, the movements are often predictable. 
In Section 1^ we explore two specific models of human motion. The first is a sim- 
ple, general model of constant angular velocity. The second is an activity-specific 
model of walking. 

4 Bayesian Formulation 

The goal of tracking a human figure can now be formulated as the computation 
of the posterior probability distribution over the parameters of the generative 
model at time t, given a sequence of images. It; i.e., p(^(,Vt,Rt |It). This can 
be expressed as a marginalization of the joint posterior over all states up to time 
t given all images up to time t: 

p(</.t,Vt,Rt|it) = J p(0t,Vt,Rt|It)d0t-irfVt_idRt-i . (7) 

Using Bayes’ rule and the Markov assumptions above, it can be shown that 
the dependence on states at times before time t — 1 can be removed, to give 

p(0t,Vt,Rt|It) = 

p(It I Vt,Rt) J [p(0t, Vt,Rt I </>t_i, Vt_i,Rt_i,If_i) 

Vt_i, Rf_i |It-i)] d4>f_idVt_idR,t-i (8) 

where k is a normalizing constant that does not depend on the state variables. 
Here, p{It | Vj, Rt), which we refer to as the “likelihood,” is the probability of 
observing the image at time t, given the shape, motion and appearance states at 
time t. The integral in (0 is referred to as a temporal prior, or a prediction, as it 
is equivalent to the probability over states at time t given the image measurement 
history; i.e., K is useful to understand the integrand as the 

product of two terms; these are the posterior probability distribution over states 
at the previous time, p{cj)f_i,Vt-i,'Rt-i \ It-i)) and the dynamical process that 
propagates this distribution over states from time t — 1 to time t. 

Before turning to the computation of the posterior in (0, it is useful to sim- 
plify it using the generative model described above. For example, the likelihood 
of observing the image at time t does not depend on the velocity Vj, and the- 
refore p{lt I V^,Rt) = p{lt I <^f,Rt). Also, the probability distribution over 
the state variables at time t, conditioned on those at time t — 1, can be factored 
further. This is based on the generative model, and the assumption that the evo- 
lution of velocity and shape from time t — 1 to t is independent of the evolution 
of appearance. This produces the following factorization 



I </>,_!, V,_i)p(Vt I Vt_i)p(Ri I It-iAt-i) ■ 



Stochastic Tracking of 3D Human Figures Using 2D Image Motion 709 



Finally, these simplifications, taken together, produce the posterior distribu- 
tion 



J [p(0J Vt_i)p(Vi I V*_i)p(Rt I It_i, </>,_!) 

Vt_i, Rt_i I c?^(_]^(iV(_ic?Rt_i . (9) 

4.1 Stochastic Optimization 

Computation of the posterior distribution is difficult due to the nonlinearity of 
the likelihood function over model parameters. This is a consequence of self- 
occlusions, viewpoint singularities, and matching ambiguities. While we cannot 
derive an analytic expression for the likelihood function over the parameters of 
the entire state space, we can evaluate the likelihood of observing the image given 
a particular state (</>(, Vf, Rf); the computation of this likelihood is described 
in Sectional 

Representation of the posterior is further complicated by the use of a non- 
linear dynamical model of the state evolution as embodied by the temporal prior. 
While we cannot assume that the posterior distribution will be Gaussian, or 
even unimodal, robust tracking requires that we maintain a representation of the 
entire distribution and propagate it through time. For these reasons we represent 
the posterior as a weighted set of state samples, which are propagated using a 
particle filter with sequential importance sampling. Here we briefly describe the 
method (for foundations see I11I13I . and for applications to 2D image tracking 
with non-linear temporal models see m)- 

Each state, S(, is represented by a vector of parameter assignments, Sj = 
[^(,V(]. Note that in the current formulation we can drop the appearance mo- 
del Rj from the state as it is completely determined by the shape parameters 
and the images. The posterior at time t — 1 is represented by N state samples 
{N Ri 10'^ in our experiments). To compute the posterior (0 at time t we first 
draw N samples according to the posterior probability distribution at time t—1. 
For each state sample from time t—1, we compute R* given the generative 
model. We propagate the angular velocities forward in time by sampling from 
the prior p(Vt \ V(_i). Similarly, the shape parameters are propagated by sam- 
pling from p{(pf I V(_i). At this point we have new values of (p^ and Rj 
which can be used to compute the likelihood p{It | ^(,Rt). The N likelihoods 
are normalized to sum to one and the resulting set of samples approximates the 
posterior distribution p{(j)^,Vt,Rt \ It) at time t. 



5 Likelihood Computation 

The likelihood p{lt | (/>(,Rt) is the probability of observing image It given that 
the human model has configuration cp^ and appearance Rf at time t. To compare 
the image. It, with the generative model, the model must be projected into the 
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Fig. 2. Planar approximation of limbs improves efficiency. 



image plane of the camera as described in Section 0 To reduce the influence of 
camera noise on the matching, the images, It, are smoothed by a Gaussian Alter 
with a standard deviation of This has the effect of smoothing the likelihood 
function over model parameters and hence the posterior distribution. 

Projection. The projection of limb surface points into the image plane and vice 
versa is computationally expensive. Given the stochastic sampling framework, 
this operation is performed many times and hence we seek a efficient approxima- 
tion. To simplify the projection onto the image, we first project the the visible 
portion of the cylindrical surface onto a planar patch that bisects the cylinder, 
as shown in Figure El The projection of the appearance of a planar patch into 
the image can be performed by first projecting the corners of the patch via per- 
spective projection. The projection of other limb points is given by interpolation. 
This approximation speeds up the likelihood computation significantly. 

Recall that the variance in the generative model Q depends on the angle, 
a(xj,0(), between of the surface normal and the optical axis of the camera. 
With the planar approximation, becomes the angle between the image plane 
and the Z axis of limb j . 

Likelihood Model. Given the generative model we define the likelihood of each 
limb j independently. We sample, with replacement, i = 1 . . . n pixel locations, 
Xj i, uniformly from the projected region of limb j. According to (0, the grayva- 
lue differences between points on the appearance model and the corresponding 
image values are independent and are modeled as a mixture of a zero-mean 
normal distribution and a uniform outlier distribution. We expect outliers, or 
unmatched pixels, to result from occlusion, shadowing, and wrinkled clothing. 

The image likelihood of limb j is then expressed as: 



e 

Pimage = + 




exp(- 

2=1 



2aHa,) > 



( 10 ) 



where It(xj,i) = M 

The likelihood must also account for occlusion which results from the depth 
ordering of the limbs or from the surface orientation. To model occluded regions 
we introduce the constant probability, p occluded, that a limb is occluded, p occluded 
is currently determined empirically. 

To determine self occlusion in the model configuration the limbs are 
ordered according to the shortest distance from the limb surface to the image 
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plane, using the camera parameters and cf)^. Limbs that are totally or partly 
covered by other limbs with lower depth are defined as occluded. This occlusion 
detection is sub-optimal and could be refined so that portions of limbs can be 
defined as occluded (cf. pT|L 

Similarly as the limb is viewed at narrow angles (all visible surface normals 
are roughly perpendicular to the viewing direction) the linearized limb shape 
formulation makes the appearance pattern highly distorted. In this case, the 
limb can be thought of as occluding itself. 

We then express the likelihood as a mixture between Pimage and the constant 
probability of occlusion, p occluded- The visibility g, (i.e. the influence of the actual 
image measurement), decreases with the increase of the angle Uj between the 
limb j principal axis and the image plane. When the limb is exactly perpendicular 
to the image plane, it is by this definition considered occluded. The expression 
for the image likelihood of limb j is defined as: 

Pj — Q(^Olj)pi-oiage T (1 Q{c^j')')Poccluded (H) 

where q{aj) = cos(aj) if limb j is non-occluded, or 0 if limb j is occluded. 

According to the generative model, the appearance of the limbs are indepen- 
dent and the likelihood of observing the image given a particular body pose is 
given by the product of the limb likelihoods: 

p{h\cf>,,Il,)=Y[p, . (12) 

j 



6 Temporal Model 

The temporal model encodes information about the dynamics of the human 
body. Here it is formulated as a prior probability distribution and is used to 
constrain the sampling to portions of the parameter space that are likely to cor- 
respond to human motions. General models such as constant acceleration can 
account for arbitrary motions but do not constrain the parameter space greatly. 
For a constrained activity such as walking or running we can construct a tempo- 
ral model with many fewer degrees of freedom which makes the computational 
problem more tractable. Both types of models are explored below. 

6.1 Generic Model: Smooth Motion 

The smooth motion model assumes that the angular velocity of the joints and the 
velocity of the body are constant over time. Recall that the shape parameters 
are given by (p^ = [t^,9^,9\] where r® and 0® represent the translation and 
rotation that map the body into the world coordinate system and 9\ represents 

the relative angles between all pairs of connected limbs. Let V( = 
represent the corresponding velocities. The physical limits of human movement 
are modeled as hard constraints on the individual quantities such that 4>^ G 

[^min) ^max] ■ 
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Fig. 3. Learning a walking model, (a) Joint angles of different people walking were 
acquired with a motion capture system. Curves are segmented into walking cycles 
manually and an eigenmodel of the cycle is constructed, (b) Mean angle of left knee as 
a function of time, (c) First three eigenmodes of the left knee Bj, j G [1,3], scaled by 
their respective variance Xj. (1 = solid, 2 = , 3 = • • •.) 



Our smooth motion model assumes that all elements (j)k,t C <Pt ^’^tl Vq^t C Vj 
are independent. The dynamics are represented by 



1 1 j ^ 


4.-0 


“t” — l) , If ^i,t € 4^i,maK 

otherwise 


1 1 


= G{Vi^t — 


V^^t-l,crY), 



where G{x,a) denotes a Gaussian distribution with zero mean and standard 
deviation a, evaluated at x. The standard deviations trf and aY are empiri- 
cally determined. The joint angles of heavy limbs typically have lower standard 
deviations than those in lighter limbs. 

This model works well for tracking individual body parts that are relatively 
low dimensional. This is demonstrated in Section Q for tracking arm motion 
(cf. 0 ). This is a relatively weak model for constraining the motion of the 
entire body given the current sampling framework and limited computational 
resources. In general, one needs a variety of models of human motion and a 
principled mechanism for choosing among them. 

6.2 Action Specific Model: Walking 

In order to build stronger models, we can take advantage of the fact that many 
human activities are highly constrained and the body is often moved in symme- 
tric and repetitive patterns. In what follows we consider the example of walking 
motion. 

Training data corresponding to the 3D model parameters was acquired with 
a commercial motion capture system. Some of the data are illustrated in Figure 
El From the data, m = 13 example walking cycles from 4 different subjects 
(professional dancers) were segmented manually and scaled to the same length. 
These cycles are then used to train a walking model using Multivariate Principal 
Component Analysis (MPCA) |31 19123] . In addition to the joint angles, we model 
the speed, of the torso in the direction of the walking motion i. This speed, 
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at time step fi in the cycle is — r®^|| . The curves corresponding to 

the speed of the torso and the relative angles of the limbs, (p\, are concatenated 
forming column vectors for each training example i = 1 . . . m. The mean 
vector A is subtracted from all examples: Ai = Ai~ A. Since the walking speed 
[m/frame] and the joint angles [radians] have approximately the same scales they 
need not be rescaled before applying MPCA. 

The eigenvalues Xj and eigenvectors j S [l,m] of the matrix A = 
[A\, - ■ ■ ,Am] are now computed from A = using Singular Value De- 

composition (SVD) where B = [Bi, • • • ,Bm] and X' is a diagonal matrix with 
Xj along the diagonal. The eigenvectors represent the principal modes of varia- 
tion in the training set, while the eigenvalues reflect the variance of the training 
set in the direction of the corresponding eigenvector. The eigenvectors Bj can be 
viewed as a number of eigencurves, one for each joint, stacked together. Figure 
shows three eigencurves corresponding to the left knee walking cycle. 

The smallest number d of eigenvectors Bj such that > 0.95 is sel- 

ected; in our case d = 5. With B = [Bi,---,Bd] we can, with d parameters 
c = [ci, • • • , Cd]^, approximate a synthetic walking cycle A* as: 

A* = A-f-Bc. (13) 

The set of independent parameters is now {cj, /r^, r®, 0® } where denotes 
the current position (or phase) in the walking cycle. Thus, this model reduces 
the original 25-dimensional parameter space, <p, to a 12-dimensional space. 

Recall that the global translation and rotation, rf, can be expressed as 
a homogeneous transformation matrix T. We also define Vt-i to be the learned 
walking speed at time t — 1. The parameters are propagated in time as: 



p{ct\ct-i) = G{ct- ct-i,cr^ld) (14) 

p{pt\P't-i) = G{pt- (15) 

p(T®|Tt_i,Ct_i) = G([r®,l]^-T-_\[ut_i 0 0 1 ]^,*t^/3) (16) 

p(0f |0t_i) = G(0*-0*_i,<T®/3) (17) 



'T f) 

where cr^, and cr'^ represent the empirically determined standard deviations. 
In is an n X n identity matrix, and cr'^ = e\ where e is a small scalar with 
A = [Ai, • • • , Ad]^. e is expected to be small since we expect the c parameters to 
vary little throughout the walking cycle for each individual 

From a particular choice of {/it, c^}, the relative joint angles are 0\ = A* (/it) = 
A(/it)-l-(Bc)(/it), where A*{pt) indicates the interpolated value of each joint cy- 
cle, A*, at phase p. The angular velocities, = A*(/it-|-l) — A*(/if), are not esti- 
mated independently and the velocities rf , 0^ are propagated as in the smooth 
motion case above. The Gaussian distribution over pt and Ct implies a Gaussian 
distribution over joint angles which defines the distribution p{4>f \ 
used in the Bayesian model. 
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Fig. 4. Tracking of one arm (2000 samples). Upper rows: frames 0, 10, 20, 30, 40 and 
50 with the projection of the expected value of model the model parameters overlaid. 
Frame 0 corresponds to the manual initialization. Lower row: distributions of the shoul- 
der angles 9x, 9y and 6^ as function of frame number. Brightness values denote the log 
posterior distribution in each frame. 



7 Experiments 

We present examples of tracking people or their limbs in cluttered images. On 
an Ultra 1 Sparcstation the C-l — h implementation takes approximately 5 minu- 
tes/frame for experiments with 10,000 state samples. At frame 0, the posterior 
distribution is derived from a hand-initialized 3D model. To visualize the poster- 
ior distribution we display the projection of the 3D model corresponding to the 
expected value of the model parameters: ^ Pi4>i where pi is the normalized 
likelihood of state sample (f)^. 



Arm Tracking. The smooth motion prior is used for tracking relatively low 
dimensional models such as a single arm as illustrated in Figure El The model 
has 8 parameters corresponding to the orientation and velocity of the 3 shoulder 
angles and the elbow angle. 

The twist of the upper arm 9^ is ambiguous when the arm is straight since 
the only information about the change in 9z in that situation is the rotation 
of the texture pattern on the upper arm. If the upper arm texture is of low 
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Fig. 5. Tracking a human walking in a straight line (5000 samples, no rotation). Upper 
rows: projection of the expected model configuration at frames 0, 10, 20, 30, 40 and 50. 
Lower row: 3D configuration for the expected model parameters in the same frames. 



contrast (as in Figure H this will provide a very weak cue. This ambiguity is 
easily represented in a particle filtering framework. In our case, 9z is assigned 
a uniform starting distribution. Some frames later (around frame 20), the arm 
bends slightly, and the distribution over 9z concentrates near the true value. The 
rotation of a straight arm is an example of a kinematic singularity rrm . 

Tracking Walking People. The walking model described in Section IfT^ is used 
to track a person walking on a straight path parallel to the camera plane over 50 
frames (Figure 0. The global rotation of the torso was held constant, lowering 
the number of parameters to 9: the 5 eigencoefhcients, c, phase, /i, and global 
3D position, t®. All parameters were initialized manually with a Gaussian prior 
at time t = 0 (Figure Q frame 0). As shown in Figure 0 the model successfully 
tracks the person although some parts of the body (often the arms) are poorly 
estimated. This in part reflects the limited variation present in the training set. 

The next experiment involves tracking a person walking in a circular path 
and thus changing both depth and orientation with respect to the camera. Fi- 
gure El shows the tracking results for frames from 0 to 50. In frame 50 notice that 
the model starts to drift off the person since the rotation is poorly estimated. 
Such drift is common with optical flow-based tracking methods that rely solely 
on the the relative motion between frames. This argues for a more persistent 
model of object appearance. Note that, while a constant appearance model (i.e. 
a template) would not suffer the same sort of drift it would be unable to cope 
with changes in view, illumination, and depth. Note also that the training data 
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Fig. 6. Person walking in a circle (15000 samples). Upper rows: frames 0, 10, 20, 30, 
40, 50 with the projection of the expected model configuration overlaid. Lower row: 
expected 3D configuration in the same frames. 



only contained examples of people walking in a straight line. While the circu- 
lar walking motion here differs significantly, the temporal model is sufficiently 
general that it can approximate this new motion. 

How significant is the temporal walking prior model? Figure 0 illustrates 
the effect of repeating the above experiment with a uniform likelihood function, 
so that the evolution of the parameters is determined entirely by the temporal 
model. While the prior is useful for constraining the model parameters to valid 
walking motions, it does not unduly affect the tracking. 

8 Conclusion 

This paper has presented a Bayesian formulation for tracking of articulated hu- 
man figures in 3D using monocular image motion information. The approach 
employs a generative model of image appearance that extends the idea of para- 
meterized optical flow estimation to 3D articulated figures. Kinematic singulari- 
ties, depth ambiguities, occlusion, and ambiguous image information result in a 
multi-modal posterior probability distribution over model parameters. A particle 
filtering approach is used to represent and propagate the posterior distribution 
over time, thus tracking multiple hypotheses in parallel. To constrain the distri- 
bution to valid 3D human motions we define prior probability distributions over 
the dynamics of the human body. Such priors help compensate for missing or 
noisy visual information and enable stable tracking of occluded limbs. Results 



Stochastic Tracking of 3D Human Figures Using 2D Image Motion 



717 




Fig. 7 . How strong is the walking prior? Tracking results for frames 0, 10, 20, 30, 40 
and 50, when no image information is taken into account. 



were shown for a general smooth motion model as well as for an action-specific 
walking model. 

A number of outstanding issues remain and are the focus of our research. 
The current model is initialized by hand and will eventually lose track of the 
object. Within a Bayesian framework we are developing a fully automatic system 
that samples from a mixture of initialization and temporal priors. We are also 
developing new temporal models of human motion that allow more variation than 
the eigencurve model yet are more constrained than the smooth motion prior. 
We are extending the likelihood model to better use information at multiple 
scales and to incorporate additional generative models for image features such 
as edges. Additionally, the likelihood computation is being extended to model the 
partial occlusion of limbs as in izq. Beyond this, one might replace the cylindrical 
limbs with tapered superquadrics |t)f 1 .5j and model the prior distribution over 
these additional shape parameters. Finally, we are exploring the representation 
of the posterior as a mixture of Gaussians 0. This provides a more compact 
representation of the distribution and interpolates between samples to provide 
a measure of the posterior in areas not covered by discrete samples. 
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Abstract. The problem of detecting and labeling a moving human body 
viewed monocularly in a cluttered scene is considered. The task is to 
decide whether or not one or more people are in the scene (detection), 
to count them, and to label their visible body parts (labeling). 

It is assumed that a motion-tracking front end is supplied: a number of 
moving features, some belonging to the body and some to the background 
are tracked for two frames and their position and velocity is supplied 
(Johansson display). It is not guaranteed that all the body parts are 
visible, nor that the only motion present is the one of the body. 

The algorithm is based on our previous work m we learn a probabi- 
listic model of the position and motion of body features, and calculate 
maximum-likelihood labels efficiently using dynamic programming on a 
triangulated approximation of the probabilistic model. We extend those 
results by allowing an arbitrary number of body parts to be undetec- 
ted (e.g. because of occlusion) and by allowing an arbitrary number of 
noise features to be present. We train and test on walking and dancing 
sequences for a total of approximately 10'* frames. The algorithm is de- 
monstrated to be accurate and efficient. 



1 Introduction 

Humans have developed a remarkable ability in perceiving the posture and mo- 
tion of the human body (‘biological motion’ in the human vision literature). 
Johansson 0 filmed people acting in total darkness with small light bulbs fixed 
to the main joints of their body. A single frame of a Johansson movie is nothing 
but a cloud of bright dots on a dark field; however, as soon as the movie is ani- 
mated one can readily detect, count, segment a number of people in a scene, and 
even assess their activity, age and sex. Although such perception is completely 
effortless, our visual system is ostensibly solving a hard combinatorial problem 
(which dot should be assigned to which body part of which person?). 

Perceiving the motion of the human body is difficult. First of all, the human 
body is richly articulated - even a simple stick model describing the pose of arms, 
legs, torso and head requires more than 20 degrees of freedom. The body moves 
in 3D which makes the estimation of these degrees of freedom a challenge in a 
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monocular setting m- Image processing is also a challenge: humans typically 
wear clothing which may be loose and textured. This makes it difficult to identify 
limb boundaries, and even more so to segment the main parts of the body. In a 
general setting all that can be extracted reliably from the images is patches of 
texture in motion. It is not so surprising after all that the human visual system 
has evolved to be so good at perceiving Johansson’s stimuli. 



Perception of biological motion may be divided into two phases: first detec- 
tion and, possibly, segmentation; then tracking. Of the two, tracking has recently 
been object of much attention and considerable progress has been made imni 
1411121141 ^ . Detection (given two frames: is there a human, where?), on the con- 
trary, remains an open problem. In m, we have focused on the Johannson 
problem proposing a method based on probabilistic modeling of human motion 
and on modeling the dependency of the motion of body parts with a triangulated 
graph, which makes it possible to solve the combinatorial problem of labeling 
in polynomial time. Excellent and efficient performance of the method has been 
demonstrated on a number of motion sequences. However, that work is limited 
to the case where there is no clutter (the only moving parts belong to the body, 
as in Johansson’s displays). This is not a realistic situation: in typical scenes one 
would expect the environment to be rich of motion patterns (cars driving by, 
trees swinging in the wind, water rippling... as in Figure Q). Another limitation 
is that only limited amounts of occlusion is allowed. This is again not realistic: 
in the typical situations little more than half of the body is visible, the other 
half being self-occluded. 




Fig. 1. Perception of biological motion in real scenes: one has to contend with a large 
amonnt of clntter (more than one person in the scene, other objects in the scene are also 
moving) , and a large amount of self-occlusion (typically only half of the body is seen) . 
Observe that segmentation (arm vs. body, left and right leg) is at best problematic. 

We propose here a modification of our previous scheme which addresses 
both the problem of clutter and of large occlusion. We conduct experiments 
to explore its performance vis a vis different types and levels of noise, variable 
amounts of occlusion, and variable numbers of human bodies in the scene. Both 
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the detection performance and the labeling performance are assessed, as well as 
the performance in counting the number of people in the scene. 

In section 2 we first introduce the problem and some notation, then propose 
our approach. In section 3 we explain how to perform detection. In section 4 a 
simple method for aggregating information over a number of frames is discussed. 
In section 5 we explain how to count how many people there may be in the 
picture. Section 6 contains the experiments. 



2 Labeling 

In the Johansson scenario, each body part appears as a single dot in the image 
plane. Our problem can then be formulated as follows: given the positions and 
velocities of a number of point-features in the image plane (Figure Q(a)), we 
want to find the configuration that is most likely to correspond to a human 
body. Detection is done based on how human-like the best configuration is. 




Fig. 2. Illustration of the problem. Given the position and velocity of point-features 
in the image plane (a), we want to hnd the best possible human configuration: hlled 
dots in (b) are body parts and circles are background points. Arrows in (a) and (b) 
show the velocities, (c) is the full configuration of the body. Filled (blackened) dots 
represent the ’observed’ points which appear in (b), and the ’*’s are unobserved body 
parts. ’L’ and ’R’ in label names indicate left and right. H:head, N:neck, S:shoulder, 
E:elbow, W:wrist, H:hip, K:knee and A:ankle. 

2.1 Notation 

Suppose that we observe N points (as in Figure EJa), where N = 38). We assign 
an arbitrary index to each point. Let Sbody = {LW, LE, LS, H . . . RA} be the 
set of M body parts, for example, LW is the left wrist, RA is the right ankle. 



etc. Then: 

i G 1, . . . , N Point index (1) 

X = [Ail, . . . ,Xpf] Vector of measurements (position and velocity) (2) 
L = [Li, . . . , Ljv] Vector of labels (3) 

Li G Sbody U {BG} Set of possible values for each label (4) 
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Notice that since there exist clutter points that do not belong to the body, the 
background label BG is added to the label set. Due to clutter and occlusion N is 
not necessarily equal to M (which is the size of Sbody)- We want to find L , over 
all possible label vectors L, such that the posterior probability of the labeling 
given the observed data is maximized, that is, 

L = argmaxP(L|X) (5) 

where P{L\X) is the conditional probability of a labeling L given the data X. 
Using Bayes’ law: 



P{L\X) = P{X\L)^^ 



(6) 



If we assume that the priors P{L) are equal for different labelings, then. 



L* = argmaxP(X|L) (7) 

L€C 

Given a labeling L, each point feature i has a corresponding label Li. The- 
refore each measurement Xi corresponding to body labels may also be written 
as XL^, i.e. the measurements corresponding to a specific body part associated 
with label Li. For example if Li = LW, i.e. the label corresponding to the left 
wrist is assigned to the ith point, then Xi = X^^y is the position and velocity 
of the left wrist. 

Let Cbody denote the set of body parts appearing in L, X^ody be the vector 
of measurements labeled as body parts, and X^g be the vector of measurements 
labeled as background {BG). More formally, 

Gbody — {.bj, i — 1, . . . , IV} n Sbody 

^ body — [^ii such that i ■ • ■ i ^iK } — Gbody 

Xbg = [Xj, ,..., Xj^_^] such that Lj, = ■ ■ ■ = Lj^_^ = BG (8) 

where K is the number of body parts present in L. 

If we assume that the position and velocity of the visible body parts is inde- 
pendent of the position and velocity of the clutter points, then, 

P(X\L) = Pc,^,^{Xbody) • Pby{Xby) (9) 

where fjbod (Xbody) is the marginalized probability density function of Ps^ody 
according to Lbody If independent uniform background noise is assumed, then 
Pbg{Xbg) = {l/S)^~^ , where N — K is the number of background points, and S 
is the volume of the space Xi lies in, which can be obtained from the training set. 
In the following sections, we will address the issues of estimating P-^^^^^{X body) 
— * 

and finding the L with the highest likelihood. 
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2.2 Approximation of the Foreground Probability Density Function 

If no body part is missing, we can use the method proposed in m to get the 
approximation of the foreground probability density {Xhody)- By using the 

kinematic chain structure of human body, the whole body can be decomposed as 
in Figure 0a). If the appropriate conditional independence (Markov property) 
is valid, then 



^Cbody (^body) 

= PSbady i^LW, XlE, XlS, Xh ■ ■ ■ Xra) 

= Plw\le,ls{Xlw\Xle, Xls) ■ Ple\ls,lh{Xle\ ■■■)■■■■ 
■Prk.la,ra{Xrk , Xla, Xea) 

= AcJ • Pt{Xa^,Xb^.Xc^) (10) 

Where T is the number of triangles in the decomposed graph in Figure 0a), t 
is the triangle index, and At is the first label associated to triangle t, etc. 



H 





Fig. 3. (a) One decomposition of the human body into triangles m The label names 
are the same as in Figure 0 The numbers inside triangles give the order in which 
dynamic programming proceeds, (b) An illustrative example used in section 

If some body parts are missing, then the foreground probability density fun- 
ction is the marginalized version of the above equation - marginalization over 
the missing body parts. Marginalization should be performed so that it is a good 
approximation of the true marginal probability density function and allows ef- 
ficient computation such as dynamic programming. We propose that doing the 
marginalization term by term (triangle by triangle) of equation Ijl H) and then 
multiplying them together is a reasonable way to get such an approximation. 
The idea can be illustrated by a simple example as in Figure 0b). Considering 
the joint probability density function of 5 random variables P{A, B,C, D, E), if 
these random variables are conditionally independent as described in the graph 
of Figure 0(b), then 



P{A, B, C, D, E) = P{A, B, C)P{D\B, C)P{E\C, D) 



( 11 ) 
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If A is missing, then the marginalized PDF is P{B,C, D, E). If the conditional 
independence as in equation m can hold, then, 

P{B, C, D, E) = P{B, C) ■ P{D\B, C) ■ P{E\C, D) (12) 

In the case of D missing, the marginalized PDF is P{A,B,C,E). If we assume 
that E is conditionally independent of A and B given (7, which is a more de- 
manding conditional independence requirement with the absence of D compared 
to that of equation dJ), then, 

P{A, B, C, E) = P{A, B,C)-1- P{E\C) (13) 

Each term on the right hand sides of equations da and m is the mar- 
ginalized version of its corresponding term in equation da. Similarly, if some 
stronger conditional independence can hold, we can obtain an approximation of 
(Xbody) by performing the marginalization term by term of equation (IIOII . 
For example, considering triangle {At, Bt,Ct), 1 < t < T— 1, if all of At, Bt and 
Ct are present, then the tth term of equation da 

if At is missing, the marginalized version of it is 1; if At and Ct are observed, 
but Bt is missing, it becomes PAt\Cti^At\^Ct))'^ if exists but both Bt and 
Ct missing, it is PAt(XAt)- For the Tth triangle, if some body part(s) are mis- 
sing, then the corresponding marginalized version of Pt is used. The foreground 
probability {Xbody) can be approximated by the product of the above 

(conditional) probability densities. Note that if too many body parts are mis- 
sing, the conditional independence assumptions of the graphical model will no 
longer hold; it is reasonable to assume that the wrist is conditionally indepen- 
dent of the rest of the body given the shoulder and elbow, but if both shoulder 
and elbow are missing, this is no longer true. Nevertheless, we will use inde- 
pendence as an approximation. All the above (conditional) probability densities 
(e.g. Plw\le,ls{^lw\Xle, Xls)) can be estimated from the training data. 

2.3 Cost Functions and Comparison of Two Labelings 

The best labeling (L ) can be found by comparing the likelihood of all the 

1 2 

possible labelings. To compare two labelings L and L , if we can assume that 
the priors P{L ) and P{L ) are equal, then by equation (0, 

P{L^\X) _ P{X\L^) _ ^cl,ty^^body) ■ Pbgjxlg) 

P{L^\X) ~ P{X\L^) ~ P^2^jxl^^) ■ Pbgixl) 

~ P^2^jxl^y) ■ 

Pcijxidy) ■ {i/sr~^^ 



( 14 ) 
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1 2 1 2 

where and are the sets of observed body parts for L and L respec- 

I 2 

tively, Ki and K 2 are the sizes of Cf^^^y and and M is the total number 

of body parts (M = 14 here). (^lodv)j * = 1)2, can be approximated as in 

^body ^ 

section TTA From equation (d, the best labeling L is the L which can maxi- 
mize ^ {Xbody)-{^/ S)^~^ . This formulation makes both search by dynamic 
programming and detection in different frames (possibly with different numbers 
of candidate features N) easy, as will be explained below. 

The dynamic programming algorithm pni requires that the local cost fun- 
ction associated with each triangle (as in Figure|3(a)) should be comparable for 
different labelings: whether there are missing part(s) or not. Therefore we cannot 
only use the terms of P^^^^^{Xbody), because, for example, as we discussed in 

the previous subsection, the term of P^^^^^{Xbody) is PAt\BtCti^At\^BtJ^Ct) 
when all the three parts are present and it is 1 when At is missing. It is unfair 
to compare PAt\BtCti^At\^Btt^Ct) 1 directly. At this point, it is useful 
to notice that in body) ■ , for each unobserved (missing) body 

part {M — K in total), there is a l/^ term. 1/S {S is the volume of the space 
Xa^ lies in) can be a reasonable local cost for the triangle with vertex At (the 
vertex to be deleted) missing because then for the same stage, the dimension of 
the domain of the local cost function is the same. Also, l/^ can be thought of 
as a threshold of (A^JAb* , AcJ, namely, if PA^iB^cA^AtlXst, XcJ is 

smaller than 1/S, then the hypothesis that At is missing will win. Therefore, the 
local cost function for the {1 S: t < T — 1) triangle can be approximated as 
follows: 

- if all the three body parts are observed, it is PAqBtCt(AAt|ABj, A^J; 

- if At is missing or two or three of At, Bt, Ct are missing, it is 1/5; 

- if either Bt or Ct is missing and the other two body parts are observed, then 
it is PAt\Ct{^At\Xct) or PA^\Bt{^At\XBt)- 

The same idea can be applied to the last triangle T. These approximations are to 
be validated in experiments. Notice that when two body parts in a triangle are 
missing, only velocity information for the third body part is available since we use 
relative positions. The velocity of a point alone doesn’t have much information, 
so for two parts missing, we use the same cost function as the case of three body 
parts missing. 

With the local cost functions defined above, dynamic programming can be 
used to find the labeling with the highest Pc,,„^yi^body) ■ (1/5)^^“^. The com- 
putational complexity is on the order ol M * N^. 

3 Detection 

Given a hypothetical labeling L, the higher P{X\L) is, the more likely it is that 

— * 

the associated configuration of features represents a person. The labeling L 
with the highest P^^ ^ (X^ody) ■ {1/ S)^~^ provides us with the most human-like 
configuration out of all the candidate labelings. Note that since the dimension 
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of the domain of body) ■ ^ is fixed regardless of the number 

of candidate features and the number of missing body parts in the labeling L, 
we can directly compare the likelihoods of different hypotheses, even hypotheses 
from different images. 

In order to perform detection we first get the most likely labeling, then com- 
pare the likelihood of this labeling to a threshold. If the likelihood is higher 
than the threshold, then we will declare that a person is present. This threshold 
needs to be set based on experiments, to ensure the best trade-off between false 
acceptance and false rejection errors. 

4 Integrating Temporal Information | TfTij 

So far, we have only assumed that we may use information from two consecutive 
frames, from which we obtain position and velocity of a number of features. 
In this section we would like to extend our previous results to the case where 
multiple frames are available. However, in order to maintain generality we will 
assume that tracking across more than 2 frames is impossible. This is a simplified 
model of the situation where, due to extreme body motion or to loose and 
textured clothing, tracking is extremely unreliable and each individual feature’s 
lifetime is short. 

Let P{0\X) denote the probability of the existence of a person given X. From 
equation and the previous section, we use the approximation: P{0\X) is pro- 
portional to P(X\L*) defined as P{X\L*) max^g^ P^^^^^{Xbody)-{^/S)^~^ , 

where L is the best labeling found from X. Now if we have n observations 
Xi,X 2 , . . . , X„, then the decision depends on: 

P(0|Xi,X2,...,X„) 

= P(Xi,X2, ■ . -,X^\0) ■ P{0)/P(Xi,X2 , . . . 

= P(Xi\0)P(X2\0) . ..P(Xr,\0) ■ P{0)/P(Xi,X2, ...,Xn) (15) 

The last line of equation ca) holds if we assume that Xi, X 2 , . ■ ■ , Xn are in- 
dependent. Assuming that the priors are equal, P(0|Ai, X 2 , . . . , A„) can be 
represented by P{Xi\0) . . .P(A„|0), which is proportional to YYi^iP {X i\L*) . 
If we set up a threshold for Y\unx^\ Lj), then we can do detection given 
Ai,A2,...,A„. 

5 Counting 

Counting how many people are in the scene is also an important task since 
images often have multiple people in them. By the method described above, 
we can first get the best configuration to see if it could be a person. If so, all 
the points belonging to the person are removed and the next best labeling can 
then be found from the rest of points. We repeat until the likelihood of the best 
configuration is smaller than a threshold. Then the number of configurations 
with likelihood greater than the threshold is the number of people in the scene. 
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6 Experiments 

In this section we explore experimentally the performance of our system. The 
data were obtained from a 60 Hz motion capture system. The motion capture 
system can provide us with labeling for each frame which can be used as ground 
truth. In our experiments, we assumed that both position and velocity were 
available for each candidate point. The velocity was obtained by subtracting the 
positions in two consecutive frames. 

Two different types of motions were used in our experiments, walking and 
dancing. Figure E| shows sample frames of these two motions. 




3972 3987 4002 4017 4032 4047 4062 4077 4092 




Fig. 4. Sample frames, (a) a walking sequence; (b) a dancing sequence. Eight Filled 
dots denote the eight observed body parts; the open circles mark points that are ac- 
tually missing (not available to the program). The numbers along the horizontal axes 
indicate the frame numbers. 



6.1 Training of the Probabilistic Models 

The probabilistic models were trained separately for walking and dancing, and 
in each experiment the appropriate model was used. For the walking action, two 
sequences of 7000 frames were available. The first sequence was used for training, 
and the second sequence for testing. For the dancing action, one sequence of 5000 
frames was available; the first half was used for training, and the second half for 
testing. 

The training was done by estimating the joint (or conditional) probabilistic 
density functions (pdf) for all the triplets as described in section 2. For each 
triplet, position information was expressed within a local coordinate frame, i.e. 
relative positions, and velocities were absolute ones. As in H2], we assumed that 
all the pdfs were Gaussian, and the parameters for the Gaussian distribution were 
estimated from the training set. 
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6.2 Detection 

In this experiment, we test how well our method can distinguish whether or not 
a person is present in the scene (Figure El . We present the algorithm with two 
types of inputs (presented randomly in equal proportions); in one case only clut- 
ter (background) points are present, in the other body parts from the walking 
sequence are superimposed on the clutter points. We call ’detection rate’ the 
fraction of frames containing a body that is recognized correctly. We call ’false 
alarm rate’ the fraction of frames containing only clutter where our system de- 
tects a body. 

We want to test the detection performance when only part of the whole body 
(with 14 body parts in total) can be seen. We generated the signal points (body 
parts) in the following way: for a fixed number of signal points, we randomly 
selected which body parts would be used in each frame (actually pair of frames, 
since consecutive frames were used to estimate the velocity of each body part). 
So in principle, each body part has an equal chance to be represented, and as far 
as the decomposed body graph is concerned, all kinds of graph structures (with 
different body parts missing) can be tested. 

The positions and velocities of clutter (background) points were indepen- 
dently generated from uniform probability densities. For positions, we used the 
leftmost and rightmost positions of the whole training sequence as the horizon- 
tal range, and highest and lowest body part positions as the vertical range. For 
velocities, the possible range was inside a circle in velocity space (horizontal and 
vertical velocities) with radius equal to the maximum magnitude of body part 
velocities in the training sequences. FigureEI(a) shows a frame with 8 body parts 
and 30 added background points with arrows representing velocities. 

The six solid curves of Figure 0(a) are the receiver operating characteristics 
(ROCs) obtained from our algorithm when the ’positive’ test images contained 
3 to 8 signal points with 30 added background points and the ’negative’ test 
images contained 30 background points. The more signal points, the better the 
ROC. With 30 background points, when the number of signal points is more 
than 8, the ROCs are almost perfect. 

When using the detector in a practical situation, some detection threshold 
needs to be set; if the likelihood of the best labeling exceeds the threshold, a 
person is deemed to be present. Since the number of body parts is unknown 
beforehand, we need to fix a threshold that is independent of (and robust with 
respect to) the number of body parts present in the scene. The dashed line in 
Figure 0(a) shows the overall ROC of all the frames used for the six ROC curves 
in solid lines. We took the threshold when Pdetect = 1 — Pfaise-aiarm on it as our 
threshold. The star (’*’) point on each solid curve shows the point corresponding 
to that threshold. Figure 0(b) shows the relation between detection rate and 
number of body parts displayed with regard to the fixed threshold. The false 
alarm rate is 12.97%. 

When the algorithm can correctly detect whether there is a person, it doesn’t 
necessarily mean that all the body parts are correctly labeled. Therefore we also 
studied the correct label rate when a person is correctly detected. Figure 0 (c) 
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shows the result. While the detection rate is constant (with no errors) with 8 
or more body parts visible, the correct label rate increases with the number of 
body parts. The correct label rates here are smaller than in m since we have 
less signal points but many more background points. 




(a) 




(d) 




number of signal points (body parts) 



(b) 




(e) 




number of signal points (body parts) 



(c) 




(f) 



Fig. 5. ( a) to (e) are detection results on 2 frames only, and (f) shows the result 

of using multiple frames, (a) ROC curves. Solid lines: 3 to 8 out of 14 body parts 
with 30 background points vs. 30 background points only. The more signal points, the 
better the ROC. Dashed line: overall ROC considering all the frames. The threshold 
corresponding to Pd = 1 — Pfa on it was used for later experiments. The stars (’*’) 
on the solid curves are the points corresponding to that threshold, (b) detection rate 
vs. number of body parts displayed with regard to the fixed threshold as in (a). The 
false alarm rate is 12.97%. (c) The solid line is correct label rate vs. number of body 
parts when a person is correctly detected. The chance level is shown in dashed line, (d) 
the detection rate vs. standard deviation (in pixels) when Gaussian noise was added 
to positions, using displays composed of 8 signal points and 30 background points in 
each frame. The standard deviation of the velocity error was one tenth of that of the 
position error. The detection threshold is the same as (b) and (c), with the false alarm 
rate 12.97%. (e) results for biological clutter (background points were obtained from 
the walking sequences): detection rate vs. number of signal points. Solid line (with 
stars): with 30 added background points, false alarm rate is 24.19%; Dashed line (with 
triangles): with 20 added background points, false alarm rate is 19.45%. (f) detection 
rate (when Pdetect = 1 — Pfaise-aiarm) VS. number of frames used with only 5 body 
parts present. 



The data used above were acquired by an accurate motion capture system 
where markers were used to identify important features. In image sequences 
where people do not wear markers, candidate features can be obtained from a 
motion detector/feature tracker ( [I 1 1 1 3j 1. where extra measurement noise may 
be introduced. To test the performance of our method under that situation. 
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independent Gaussian noise was added to the position and velocity of the signal 
points (body parts). We experimented with displays composed of 8 signal points 
and 30 background points in each frame. Figure |Sl(d) shows the detection rate 
(with regard to the same threshold as Figure|3(b) and (c)) vs. standard deviation 
(in pixels) of added Gaussian noise to positions. The standard deviation of noise 
added to velocities is one tenth of that of positions, which reflects the fact that 
the position error, due to the inaccurate localization of a feature by a tracking 
algorithm ( nmsi), is usually much larger than the velocity error which is due 
to the tracking error from one frame to the next. 

We also tested our method by using biological clutter, that is, the backgro- 
und points were generated by independently drawing points (with position and 
velocity) of randomly chosen frames and body parts from the walking sequence. 
Figure El(e) shows the results. 

6.3 Using Temporal Information 

The detection rate improves by integrating information over time as discussed 
in section 2] We tested this using displays composed of 5 signal points and 30 
background points (the 5 body parts present in each frame were chosen randomly 
and independently). The results are shown in Figure |^f). 

6.4 Counting 

We call ’counting’ the task of finding how many people are present in a scene. Our 
stimuli with multiple persons were obtained in the following way. A person was 
generated by randomly choosing a frame from the sequence, and several frames 
(persons) can be superimposed together in one image with the position of each 
person selected randomly but not overlapped with each other. The statistics of 
background features was similar to that in section 16.21 (Figure EKa)), but with 
the positions distributed on a window three times as wide as that in Figure El 
(a). Figure EDa) gives an example of images used in this experiment, with three 
persons (six body parts each) and sixty background points. 

Our stimuli contained from zero to three persons. The threshold from Figure 
I3a) was used for detection. If the probability of the configuration found was 
above the threshold, then it was counted as a person. The curves in Figure El(b) 
show the correct count rate vs. the number of signal points. To compare the 
results conveniently, we used the same number of body parts for different persons 
in one image (but the body parts present were randomly chosen) . The solid line 
represents counting performance when one person was present in each image, 
the dashed line with circles is for stimuli containing two persons, and the dash- 
dot line with triangles is for three persons. If there was no person in the image, 
the correct rate was 95%. From Figure El(b), we see that the result for displays 
containing fewer people is better than that with more people, especially when 
the number of observed body parts is small. We can explain it as follows. If the 
probability of counting one person correctly is P, then the probability of counting 
n people correctly is P" if the detection of different people is independent. For 
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Fig. 6. (a) One sample image of counting experiments. ’*’s denote body parts and 
’o’s are background points. There are three persons (six body parts for each person) 
with sixty superimposed background points. Arrows are the velocities, (b) Results of 
counting experiments: correct rate vs. number of body parts. Solid line (with solid 
dots): one person; dashed line (with open circles): two persons; dash-dot line (with 
triangles): three persons. Detection of a person is with regard to the threshold chosen 
from Figure El( a). For that threshold the correct rate for recognizing that there is no 
person in the scene is 95%. 



example, in the case of four body parts, for one person the correct rate is 0.6, 
then the correct rate for counting three person is 0.6^ = 0.216. This is just an 
approximation since body parts from different persons may be very close and the 
body part of one person may be perceived as belonging to another. Furthermore, 
the assumption of independence is also violated since once a person is detected 
the corresponding body parts are removed from the scene in order to detect 
subsequent people. 



6.5 Experiments on Dancing Sequence 

In the previous experiments, walking sequences were used as our data. In this 
section, we tested our model on a dancing sequence. Results are shown in Figure 
Q The signal points (body parts) were from the dancing sequence and the clutter 
points were generated the same way as in section lb.2l (Figure 0(a)). 

7 Conclusions 

We have presented a method for detecting, labeling and counting biological mo- 
tion in a Johansson-like sequence. We generalize our previous work C2I by 
extending the technique to work on arbitrary amounts of clutter and occlusion. 

We have tested our implementation on two kinds of moving sequences (wal- 
king and dancing) and demonstrated that it performs well under conditions of 
clutter and occlusion that are possibly more challenging than one would expect 
in a typical real-life scenario. The motion clutter we injected in our displays 
was designed to resemble the motion of individual body parts, the number of 
noise points in our experiments far exceeded the number of signal points, the 
number of undetected/occluded signal features in some experiments exceeded 
the number of detected features. Just to quote one significant performance fi- 
gure: 2-frame detection rate is better than 90% when 6 out of 14 body parts 
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Fig. 7. Results of dancing sequences, (a) Solid lines: ROC curves for 4 to 10 body 
parts with 30 background points vs. 30 background points only. The more signal points, 
the better the ROC. Dashed line: overall ROC considering all the frames used in seven 
solid ROCs. The threshold corresponding to Pd = 1 — Pfa on this curve was used for 
(b). The stars (’*’) on the solid curves are the points corresponding to that threshold, 
(b) detection rate vs. the number of body parts displayed with regard to the fixed 
threshold. The false alarm rate is 14.67%. Comparing with the results in Figure El (a, 
b), we can see that more body parts must be observed during the dancing sequence to 
achieve the same detection rate as with the walking sequences, which is expected since 
the motion of dancing sequences is more active and harder to model. Nevertheless, the 
ROC curve with 10 out of 14 body parts present is nearly perfect. 



are seen within 30 clutter points (see Figure 0(b)). When the number of fra- 
mes considered exceeds 5 then performance quickly reaches 100% correct (see 
Figure El(f)). This means that even in high-noise conditions detection is flawless 
in 100ms or so (considering a 60 Hz imaging system) , a figure comparable to the 
alleged performance of the human visual system P). Moreover, our algorithm is 
computationally efficient, taking order of 1 second in our Matlab implementation 
on a regular Pentium computer, which gives significant hope for a real-time C 
implementation on the same computer. 

The next step in our work is clearly the application of our system to real 
image sequences, rather than Johansson displays. We anticipate using a simple 
feature/patch detector and tracker in order to provide the position- velocity mea- 
surements that are input in our system. Since our system can work with features 
that have a short life-span (in the limit 2-frame) this should be feasible without 
modifying the overall approach. A first set of experiments is described in mi- 
comparing in detail the performance of our algorithm with the human visual 
system is another avenue that we intend to pursue. 
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Abstract. The causal estimation of three-dimensional motion from a 
sequence of two-dimensional images can be posed as a nonlinear filtering 
problem. We describe the implementation of an algorithm whose uni- 
form observability, minimal realization and stability have been proven 
analytically in [5]. We discuss a scheme for handling occlusions, drift in 
the scale factor and tuning of the filter. We also present an extension 
to partially calibrated camera models and prove its observability. We 
report the performance of our implementation on a few long sequences 
of real images. More importantly, however, we have made our real-time 
implementation - which runs on a personal computer - available to the 
public for first-hand testing. 



1 Introduction 

Inferring the three-dimensional (3-D) shape of a moving scene from its two- 
dimensional images is one of the classical problems of computer vision, known 
by the name of “shape from motion” (SFM). Among all possible ways in which 
this can be done, we distinguish between causal schemes and non-eausal ones. 
More than the fact that causal schemes use - at any given point in time - only 
information from the past, the main difference between these two approaches lies 
in their goals and in the way in which data are collected. When the estimates 
of motion are to be used in real time, for instance to accomplish a control task, 
a causal scheme must be employed since “future” data are not available for 
processing and the control action must be taken “now” . In that case, the sequence 
of images is often collected sequentially in time, while motion changes smoothly 
under the auspices of inertia, gravity and other physical constraints. When, on 
the other hand, we collect a number of “snapshots” of a scene from disparate 
viewpoints and we are interested in reconstructing it, there is no natural ordering 
or smoothness involved; using a causal scheme in this case would be, in the end, 
highly unwise. 

No matter how the data are collected, however, SFM is subject to fundamen- 
tal tradeoffs, which we articulate in section FOl This paper aims at addressing 

* Supported by NSF IIS-9876145 and ARO DAAD19-99- 1-0139. We wish to thank 
Xiaolin Feng, Carlo Tomasi, Pietro Perona, Ruggero Frezza, John Oliensis and Philip 
McLauchlan for discussions. 
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such tradeoffs: it is possible to integrate visual information over time, hence 
achieving a global estimate of 3-D motion, while maintaining the correspon- 
dence problem local. Among the obstacles we encounter is the fact that indivi- 
dual points tend to become occluded during motion, while novel points become 
visible. In we have introduced a wide-sense approximation to the optimal 
filter and proved that it is observable, minimal and stable. In this paper we de- 
scribe a complete, real-time implementation of the algorithm, which includes an 
approach to handle occlusions causally. 



1.1 A first Formalization of the Problem 



Consider an A-tuple of points in the three-dimensional Euclidean space, repre- 
sented as a matrix 



X= [Xi X2 ...X"] 



( 1 ) 



and let them move under the action of a rigid motion represented by a translation 
vector T and a rotation matrix R. Rotation matrices are orthogonal with unit 
determinant {i? | R — RR^ = !}■ Rigid motions transform the coordinates 
of each point via i?(t)X* -|- T{t). Associated to each motion {T,R} there is 
a velocity, represented by a vector of linear velocity V and a skew-symmetric 
matrix ui of rotational velocity. Skew-symmetric 3x3 matrices are represented 
using the “hat” notation 



0 —as fl2 
as 0 — oi 

— as ai 0 



(2) 



Under such velocity, motion evolves according to 



I T{t -11) = e“«T(t) -1 V{t) ^ 3 ^ 

+ = 

The exponential of a skew-symmetric matrix can be computed conveniently 
using Rodrigues’ formula: 

||LJ|| ll^ll 

We assume that - to an extent discussed in later sections - the correspondence 
problem is solved, that is we know which point corresponds to which in different 
projections (views). Equivalently, we assume that we can measure the (noisy) 
projection 

y*(t) =7r(R(t)X‘-lT(t)) -1 A(t) elR^ Vi = l...A (5) 



where we know the correspondence y* -O- X*. We take as projection model an 

T 



ideal pinhole, so that y = 7r(X) = 



Ai Aa 

^3 X3 



. This choice is not crucial and 



the discussion can be easily extended to other projection models (e.g. spherical, 
orthographic, para-perspective, etc.). We do not distinguish between y and its 
projective coordinate (with a 1 appended), so that we can write X = yX^. 
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Finally, by organizing the time-evolution of the configuration of points and their 
motion, we end up with a discrete-time, non-linear dynamical system: 

' X(t + 1) = X(t) X(0) = Xo 

T{t -t 1) = e“WT(t) -t V{t) T(0) = To 

, = e“Wi?(t) R{0) = Ro (6) 

V{t + l) = V{t)+av (t) F(0) = Vo 

ui{t + 1) = u>{t) + aui{t) tj(0) = Wo 

^ y\t) = 7T (R{t)X\t) + T{t)) + n\t) n\t) ~ Af(0, X„) 

where v ~ Af{M, S) indicates that a vector v is distributed normally with mean 
M and covariance S. In the above system, a is the relative acceleration between 
the viewer and the scene. If some prior modeling information is available (for 
instance when the camera is mounted on a vehicle or on a robot arm), this is the 
place to use it. Otherwise a statistical model can be employed. In particular, we 
can formalize our ignorance on acceleration by modeling a as a Brownian motion 
proces^ In principle one would like - at least for this simplified formalization 
of SFM - to find the optimal solution. Unfortunately, as we explain in there 
exists no finite-dimensional optimal filter for this model. Therefore, at least for 
this elementary instantiation of SFM, we would like to derive approximations 
that are provably stable and efficient. 



1.2 Tradeoffs in Structure from Motion 

The first tradeoff involves the magnitude of the baseline and the correspondence 
problem, and has been discussed extensively in j^. When images are taken from 
disparate viewpoints, estimating relative orientation is simple, given the cor- 
respondence. However, solving the correspondence problem is difficult, for it 
amounts to a global matching problem - all too often solved by hand - which 
spoils the possibility of use in real-time control systems. When images are collec- 
ted closely in time, on the other hand, correspondence becomes an easy-to-solve 
local variational problem. However, estimating 3-D motion becomes rather dif- 
ficult since - on small motions - the noise in the image overwhelms the feeble 
information contained in the 2-D motion of the features. 

No matter how one chooses to increase the baseline in order to bypass the 
tradeoff with correspondence, one inevitably runs into deeper problems, namely 
the fact that individual feature points can appear and disappear due to oeelusions, 
or to changes in their appearance due to specularities, light distribution etc. To 
increase the baseline, it is necessary to associate the scale factor to an invariant 
of the scene. Therefore, in order to process that information, the scale factor 
must be included in the model. This tradeoff is fundamental and there is no 
easy way around it: information on shape can only be integrated as long as the 
shape is visible. 



^ We wish to emphasize that this choice is not crucial towards the conclusions reached 
in this paper. Any other model would do, as long as the overall system is observable. 
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1.3 Relation to Previous Work and Organization of the Paper 

We are interested in estimating motion so that we can use the estimates to 
accomplish spatial control tasks such as moving, tracking, manipulation etc. 
In order to do so, the estimates must be provided in real time and eausally, 
while we can rely on the fact that images are taken at adjacent instants in time 
and the relative motion between the scene and the viewer is somewhat smooth 
(rather than having isolated “snapshots”). Therefore, we do not compare our 
algorithms with batch multi-frame approaches to SFM. This includes iterative 
minimization techniques such as “bundle adjustment” . If one can afford the 
time for processing sequences of images off-line, of course a batch approach that 
optimizes simultaneously on all frames will perform betterQ 

Our work falls within the category of causal motion and structure estimation 
that has a long and rich history 1 1 1 IIYI I ?SI4I23I I qil21DI31 )I24I?SI I 'JtZiriU I 1 13 1 1341331 
The first attempts to prove stability of the schemes pro- 
posed are recent ED. The first attempts to handle occlusions in a causal schem^ 
came only a few years ago EHED. Our approach is similar in spirit to the work of 
Azarbayejani and Pentland El) extended to handle occlusions and to give correct 
weighting to the measurements. 

The first part of this study |5| contains a proof of uniform observability and 
stability of the algorithm that we describe here. In passing, we show how the 
conditions we impose on our models are tight: imposing either more or less results 
in either a biased or an unstable filter. The second part, reported in this paper, 
is concerned with the implementation of a system working in real time on real 
scenes, which we have made available to the public HH. 

2 Realization 



In order to design a finite-dimensional approximation to the optimal filter, we 
need an observable realization of the original modefl. In El we have proven the 
following claim. 

Corollary 1 The model 

' yo(i + 1) = yo(i) i = 4...iv yo(o) = yo 

p\t+l) = p\t) i = 2...N p\0)=ph 

T{t -t 1) = exp(D(t))T(t) -t V{t) ^ r(0) = To 

^ I7(t-|- 1) = I/ogso( 3 )(exp(D(t))exp(I?(t))) 17(0) = I7o (7) 

V{t+l) = V{t) + av{t) F(0) = yo 

u){t + 1) = ui{t) + aui(t) oj(0) = oJo 

y*(t) = 7T ^exp(i7(t))yo(t)p‘(t) -|- T(t)J + n"(t) i = 1 . . . N. 

^ One may argue that batch approaches are now fast enongh that they can be used for 
real-time processing. Our take on this issue is exposed in E), where we argue that 
speed is not the problem; robustness and delays are. 

^ There are several ways of handling missing data in a batch approach: since they do 
not extend to causal processing, we do not review them here. 

Observability in SFM has been addressed hrst in 1994 |tif77| (see also f23| for a 
more complete acconnt of these results). Observability is closely related to “gauge 
invariance” EO]. 
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is a minimal realization of (0). The notation Logso{ 3 )(R) stands for fi such 

that R = and is computed by inverting Rodrigues ’ formul(^. f2 is called the 
“canonical representation” of R. 



Remark 1 Notice that in the above claim the index for yg starts at 4, while the 
index for p* starts at 2. This corresponds to choosing the first three points as 
reference for the similarity group and is necessary (and sufficient) for guaran- 
teeing that the representation is minimal. As explained in 0/ this can be done 
without loss of generality, i.e. modulo a reordering of the states. 



2.1 Partial Autocalibration 

As we have anticipated, the models proposed can be extended to account for 
changes in calibration. For instance, if we consider an imaging model with focal 
length M 

f\X. 

7Tf(X) = - 



X2 



Ns 



(8) 



where the focal length can change in time, but no prior knowledge on how it 
does so is available, one can model its evolution as a random walk 



/(i + 1) = /(t) + a/(t) a/(t) -^AT(0,a)) 



(9) 



and insert it into the state of the model ( 0 ). As long as the overall system is 
observable, the conclusions reached in 0 will hold. The following claim shows 
that this is the case for the model Q above. Another imaging model proposed 

[ -^1 -^2 1 ^ 

in the literature is |2j: ^ which similar conclusions can be 

drawn. The reader can refer to 0 for details on definitions and characterizations 
of observability. 

Proposition 1 Let g = {T, R} and v = {V,u>}. The model 



X(t4 


■i) = 


--X{t) 


X(0) = Xo 


+ 


i) = 


e”g{t) 


fl(0) = go 


v{t + 


i) = 


v{t) 


v{0) = Vo 


f(t + 


i) = 


fit) 


/(O) = /o 


y{t)- 


" 7T/(sr(t)X(t)) 





(10) 



is observable up to the action of the group represented by T, R, a acting on the 
initial conditions. 

Proof: Consider the diagonal matrix F(t) — diag{/(t), /(t), 1} and the matrix of 

scalings A{t) as in the proof of proposition 1 in 0. Consider then two initial conditions 

® A Matlab implementation of Logso( 3 ) is included in the software distribution. 

® This / is not to be confused with the generic state equation of the filter in section 

roi 
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{Xi, pi, ui, /i} and {X 2 , P 2 , "U 2 , / 2 }- For them to be indistinguishable there must exist 
matrices of scalings A{k) and of focus F(k) such that 

I g^Xi = F(l)(g2X2) ■ A(l) 

I e”ie('=-i^”ipiXi 1) fe > 1. 

Making the representation explicit we obtain 



I R^Xi+Ti^ F{1){R2X2 + T2)A{1) , , 

\UiF{k)XkA{k) + Vi^F{k + l){U2Xk + V2)A{k + l) '' ’ 

which can be re-written as 

±kA(k)A~^(k -I- 1) - F~^(k)uf F(k + l)t/2Xfc = F{k)-^uf (F(fc + l)C2A(fe + 1 ) - Vi)A~^{k + 1). 

(13) 

The two sides of the eguation have equal rank only if it is equal to zero, which draws us 
to conclude that A{k)A~^{k -|- 1) = /, and hence A is constant. From F~^ {k)Ui F{k -|- 
1 ) 1/2 = I we get that F{k 1)U2 = UiF{k) and, since U\,U 2 G SO(3), we have 
that taking the norm of both sides 2f^(k -|- 1) -|- 1 = 2f^{k) 1, where f must be 

positive, and therefore constant: FU 2 = UiF. From the right hand side we have that 
FV 2 A = Vi, from which we conclude that A = al, so that in vector form we have 
Vi = aFV 2 - Therefore, from the second equation we have that, for any f and any a, 
we can have Vi = aFVz, Ui — FU 2 F~^ However, from the first equation we have that 
RiXi Ti = aFR 2 X 2 aFT 2 , whence - from the general position conditions - we 
conclude that Ri = aFR 2 and therefore F = I. From that we have that T\ = aFT 2 = 
aT 2 which concludes the proof. 



Remark 2 The previous claim essentially implies that the realization remains 
minimal if we add into the model the focal parameter. Note that observability 
depends upon the structural properties of the model, not on the noise, which is 
therefore assumed to be zero for the purpose of the proof. 



2.2 Saturation 

Instead of eliminating states to render the model observable, it is possible to 
design a nonlinear filter directly on the (unobservable) model (jOJ by saturating 
the filter along the unobservable component of the state space as we show in this 
section. In other words, it is possible to design the initial variance of the state 
of the estimator as well as its model error in such a way that it will never move 
along the unobservable component of the state space. 

As proposition 2 in 0 suggests, one can saturate the states corresponding to 
yj, Yq, Yq and p^. We have to guarantee that the filter initialized at yo, Po, 9 o,vo 
evolves in such a way that yl(t) = yl,yo{t) = yo)yo(^) = = Po- ^ is 

simple, albeit tedious, to prove the following proposition. 

Proposition 2 Let Pyi (0) , Ppi (0) denote the variance of the initial condition 
corresponding to the state Yq and p* respectively, and Eyi , Epi the variance of 
the model error corresponding to the same state, then Pyi(O) = 0, Eyi =0 i = 
1 . . .3 Epi = 0 implies that yo(t|t) = yo(0)j i = 1 . . - 3, and p^(t\t) = p^(0). 
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2.3 Pseudo-Measurements 

Yet another alternative to render the model observable is to add pseudo-mea- 
surement equations with zero error variance. 

Proposition 3 The model 

' yo(t + 1) = yo(t) i = i...N yo(o)=yS 

p\t+l)^p\t) i = l...N p\0) = ph 

T{t + 1) = exp(D(t))T(t) -I- V (t) 
n(t + l) = Logso( 3 )(exp(Q(t)) exp0(t))) 

, V(t+l) = V(t) + av(t) V(0) = Vo 

cj(t -I- 1) = oj(t) + aui{t) cj(0) = oJo 

y\t) = 7T ^exp(i5(t))yS(t)p'(t) -t T{t)^ + rd{t) 
p^ = V’l 

where ipi is an arbitrary (positive) constant and (jT are three non- collinear points 
on the plane, is observable. 

3 Implementation: Occlusions and Drift in SFM 

The implementation of an extended Kalman filter based upon the model GD is 
straightforward. However, for the sake of completeness we report it in section 
The only issue that needs to be dealt with is the disappearing and appearing of 
feature points, a common trait of sequences of images of natural scenes. Visible 
feature-points may become occluded (and therefore their measurements become 
unavailable), or occluded points may become visible (and therefore provide fur- 
ther measurements). New states must be properly initialized. One way of doing 
so is described in the next section 13. 1 1 Occlusion of point features do not cause 
major problems, unless the feature that disappears happens to be associated 
with the scale factor. This is unavoidable and results in a drift whose nature is 
explained in section rt.2l 

3.1 Occlusions 

When a feature point, say X®, becomes occluded, the corresponding measure- 
ment y*(t) becomes unavailable. It is possible to model this phenomenon by 
setting the corresponding variance to infinity or, in practice = MI 2 for a 
suitably large scalar M > 0. By doing so, we guarantee that the corresponding 
states yo(t) and p®(t) are not updated: 

Proposition 4 If = 00 , then yl(t -I- 1) = foit) and p'-{t -I- 1) = (T{t). 

An alternative, which is actually preferable in order to avoid useless computa- 
tion and ill-conditioned inverses, is to eliminate the states y^ and p® altogether, 
thereby reducing the dimension of the state-space. This is simple due to the 



T(0) = 0 

n(0) = 0 

(14) 

i = l...N 
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diagonal structure of the model O: the states Yq are decoupled, and there- 
fore it is sufficient to remove them, and delete the corresponding rows from the 
gain matrix K (t) and the variance (t) for all t past the disappearance of the 
feature (see section O) . 

When a new feature-point appears, on the other hand, it is not possible to 
simply insert it into the state of the model, since the initial condition is unknown. 
Any initialization error will disturb the current estimate of the remaining states, 
since it is fed back into the update equation for the filter, and generates a spurious 
transient. We address this problem by running a separate filter in parallel for 
each point using the current estimates of motion from the main filter in order to 
reconstruct the initial condition. Such a “subfilter” is based upon the following 
model, where we assume that W features appear at time r: 



Yrit + 1 ) 

p\{t + 1 ) 

y’-{t) = 7T 



+ VyHt) i = y;(0) ~ A/'(y’(r),i:„i) 

= P;(t) + VW f = l...W p‘(0)~W(l,Pp(0)) 



;p(f2(t|t)) exp(C(r|r)) [yl{t)p\{t) - T (t\t)] + T{t\t) 



t > r 



+ n'{t) 



(15) 



where I7(t|t) and T{t\t) are the current best estimates of Q and T, i7(r|r) and 
T{t\t) are the best estimates of 17 and T at t = r. In pracice, rather than 
initializing p to 1, one can compute a first approximation by triangulating on 
two adjacent views, and compute covariance of the initialization error from the 
covariance of the current estimates of motion. Several heuristics can be employed 
in order to decide when the estimate of the initial condition is good enough for 
it to be inserted into the main filter. The most natural criterion is when the 
variance of the estimation error of p\ in the subfilter is comparable with the 
variance of pg for j ^ i va. the main filter. The last step in order to insert the 
feature i into the main filter consists in bringing the coordinates of the new 
points back to the initial frame. This is done by 



X' = 



exp(i7(T|r)) 



[y\p\ - T(r|r)] . 



(16) 



3.2 Drift 

The only case when losing a feature constitutes a problem is when it is used to 
fix the observable component of the state-space (in our notation, i = 1,2,3) as 
explained in BQ. The most obvious choice consists in associating the reference 
to any other visible point. This can be done by saturating the corresponding 
state and assigning as reference value the current best estimate. In particular, if 
feature i is lost at time t, and we want to switch the reference index to feature 

^ When the scale factor is not directly associated to one feature, but is associated to 
a function of a number of features (for instance the depth of the centroid, or the 
average inverse depth), then losing any of these features causes a drift. See 0 for 
more details. 



742 



A. Chiuso et al. 



j, we eliminate yg, p® from the state, and set the diagonal block of Syj and P{t) 
with indices 3j — 3 to 3j to zero. Therefore, by proposition 0 we have that 

yo('^ + i) = yo(T) vt>o. (17) 

If yg(T) was equal to yg, switching the reference feature would have no effect on 
the other states, and the filter would evolve on the same observable component 
of the state-space defined by the reference feature i. 

However, in general the difference yg(r) = yg('r) — y^ is a random variable 
with variance S^. = P 3 j_ 3 : 3 j_i^ 3 j_ 3 : 3 j_i. Therefore, switching the reference to 
feature j causes the observable component of the state-space to move by an 
amount proportional to yg(T). When a number of switches have occurred, we 
can expect - on average - the state-space to move by an amount proportional 
to IlifT-ll^switches. As we discussed in section FT^ this is unavoidable. What we 
can do is at most try to keep the bias to a minimum by switching the reference 
to the state that has the lowest variance0. 

Of course, should the original reference feature i become available, one can 
immediately switch the reference to it, and therefore recover the original base 
and annihilate the bias. 

3.3 Complete Algorithm 

The implementation of an approximate wide-sense nonlinear filter for the model 
(0 proceeds as follows: 



Initialization Choose the initial conditions yg = y®(0), pg = 1, rTg = 
0, J7g = 0, Vg = 0, wg = 0, V i = 1 ... A. For the initial variance Pg, 
choose it to be block diagonal with blocks A'„i(0) corresponding to yg, a large 
positive number M (typically 100-1000 units of focal length) corresponding to 
p®, zeros corresponding to Tg and Cg (fixing the inertial frame to coincide with 
the initial reference frame). We also choose a large positive number W for the 
blocks corresponding to Vg and wg. 

The variance Sn(t) is usually available from the analysis of the feature 
tracking algorithm. We assume that the tracking error is independent in each 
point, and therefore Sn is block diagonal. We choose each block to be the covari- 
ance of the measurement y®(t) (in the current implementation they are diagonal 
and equal to 1 pixel std.). The variance Sw(t) is a design parameter that is 
available for tuning. We describe the procedure in section mi Finally, set 

/C(0|0) = [y'*i5’,...y^o. Po,---,Po, To, Qo, Vq , uJof Qg) 

\ P(0|0) = Po. 



® Just to give the reader an intuitive feeling of the numbers involved, we hnd that 
in practice the average lifetime of a feature is around 10-30 frames depending on 
illumination and reflectance properties of the scene and motion of the camera. The 
variance of the estimation error for yj, is in the order of 10~® units of focal length, 
while the variance of p® is in the order of 10“®^ units for noise levels commonly 
encountered with commercial cameras. 
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Transient During the first transient of the filter, we do not allow for new 
features to be acquired. Whenever a feature is lost, its state is removed from the 
model and its best current estimate is placed in a storage vector. If the feature 
was associated with the scale factor, we proceed as in section The transient 
can be tested as either a threshold on the innovation, a threshold on the variance 
of the estimates, or a fixed time interval. We choose a combination with the time 
set to 30 frames, corresponding to one second of video. 

The recursion to update the state ^ and the variance P proceed as follows: 
Let / and h denote the state and measurement model, so that equation m can 
be written in concise form as 

( ^{t + l) = +w{t) w{t) M{0,En,) 

\y{t) = h{^{t)) + n{t) n(t) ~ A/"(0, 

We then have 



Prediction: 

X p{t + i\t) = F{t)P{t\t)F^{t) -t 

Update: 



(20) 



Gain: 



+ l|t + 1) — + l|t) + L{t -\- 1) (y{t -|- 1) — h{X(t -|- 

P{t -i- l|t -t- 1) = Fit l)P(t -t- \\t)F'^ it -t- 1) -t- Lit liFnit -t- 1)L^ it -j- 1). 

(21) 



Fit + 1) = I - Lit + l)Hit + 1) 

Lit +1)^ Pit + l\t)H'^it + l)A~^it -t 1) (22) 

Ait + 1) = Hit + l)Pit + l\t)H'^it -t 1) -t Sr,it + 1) 



Linearization: 



ni) = |f(eWi)) 

Hit + l) = f^iiit+l\t)) 



(23) 



Let e^ be the i-th canonical vector in and define U*(t) = yo(^) 

Tit), Z^{t) = e|’F*(t). The i-th block-row (i = 1, . . . , A^) Hi{t) of the ma- 
trix Hit) can be written as Hi = where the time argu- 

ment t has been omitted for simplicity of notation. It is easy to check that 
Hi=-^[l2 -7r(W)]and 



dY' 

'W 



SY' 

ay'n 



0 0 



dY' 



dY' 

OT 



dY' 

~dn 



The partial derivatives in the previous expression are given by 



dY' ^ n 
dy'o 

dY' _ T 
dT. ~ 



dY' f i i agU . . Q n ■ ■ 1 

Vim-[m;yoP mr.yoP m-,yhp\ 
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The linearization of the state equation involves derivatives of the logarithm 
function in SO (3) which is available as a Matlab function in the software 
distribution m and will not be reported here. We shall use the following 
notation: 

9Lo9SO(3) (-^) ^ r dLogso(3)(R) dLogso(3) {R) dLogso(3)(R) 1 

dR ^ 9rii dr2i ■ ■ ■ dr33 J 



where rij is the element in position (i,j) of R. Let us denote R = e‘^e^; the 
linearization of the state equation can be written in the following form: 
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and the bracket (-)^ indicates that the content has been organized into a 
column vector. 



Regime Whenever a feature disappears, we simply remove it from the state 
as during the transient. However, after the transient a feature selection module 
works in parallel with the filter to select new features so as to maintain roughly 
a constant number (equal to the maximum that the hardware can handle in real 
time), and to maintain a distribution as uniform as possible across the image 
plane. We implement this by randomly sampling points on the plane, searching 
then around that point for a feature with enough brightness gradient (we use an 
SSD-type test |T7jl. 

Once a new point-feature is found (one with enough contrast along two inde- 
pendent directions), a new filter (which we call a “subfilter”) is initialized based 
on the model CHI). Its evolution is given by 

Initialization: 

' yr{r\T) = y;(r) 

PriA'r) = 1 

i'r) 



M. 



Pt{t\t) 



(24) 
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Prediction: 



Update: 

[yj.(t + l|t + l)l 

l|t+ 1)J 






(t + 
(t + 



yUt + i|t) = 


VrWt) 






Pt(^ + i|t) = 




t > r 


(25) 


-f lit) = 


Prit -\- 


l|i) "h ^w{t) 




col +!,,(. + 1) 


^yho - ^ 


(exp(J2{t)) |^exp(J2{T))j (t)p* (f ) - T (t) 


] + 


J 






(26) 



and Pr is updated according to a Riccati equation in all similar to (ED- 

After a probation period, whose length is chosen according to the same criterion 
adopted for the main filter, the feature is inserted into the state using the trans- 
formation (HH). The initial variance is chosen to be the variance of the estimation 
error of the subfilter. 



3.4 Tuning 

The variance Ewit) is a design parameter. We choose it to be block diagonal, 
with the blocks corresponding to T{t) and Q{t) equal to zero (a deterministic 
integrator). We choose the remaining parameters using standard statistical tests, 
such as the Cumulative Periodogram of Bartlett 0. The idea is that the para- 
meters in Eyj are changed until the innovation process e(f) = y{t) — is 

as close as possible to being white. The periodogram is one of many ways to 
test the “whiteness” of a stochastic process. In practice, we choose the blocks 
corresponding to yg equal to the variance of the measurements, and the elements 
corresponding to p* all equal to CTp. We then choose the blocks corresponding to 
V and uj to be diagonal with element (t„, and then we change cr„ relative to Cp 
depending on whether we want to allow for more or less regular motions. We 
then change both, relative to the variance of the measurement noise, depending 
on the level of desired smoothness in the estimates. 

Tuning nonlinear filters is an art, and this is not the proper venue to discuss 
this issue. Suffices to say that we have only performed the procedure once and 
for all. We then keep the same tuning parameters no matter what the motion, 
structure and noise in the measurements. 

4 Experiments 

The complexity of SFM makes it difficult to demonstrate the performance of 
an algorithm by means of a few plots. This is what motivated us to (a) obtain 
analytical results, which are presented in 0 , and (b) make our real-time imple- 
mentation available to the public, so that the performance of the filter can be 
tested first-hand Cl- 
in this section, for the sake of exemplification, we present a small sample 
of the performance of the filter as characterized with a few experiments on our 
real-time platform. 
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4.1 Structure Error 

One of the byproducts of our algorithms is an estimate of the position of a 
number of point-features in the camera reference frame at the initial time. We use 
such estimates for a known object in order to characterize the performance of the 
filter. In particular, the distance between adjacent point on a checkerboard patter 
(see figure CJ is known to be 2cm. We have run the filter on a sequence of 200 
frames and identified adjacent features, and plotted their distance (minus 2cm) 
in figure ^ It can be seen that the distance, despite an arbitrary initialization, 
remains well below 1mm. 




Fig. 1. (Left) A display of the real-time system. Selected features are highlighted 
by asterisks, and a virtual object (a reference frame) is placed in the scene. As the 
camera moves, the image of the virtual object is modified in real time, according to 
the estimated motion and structure of the scene, so as to make it appear stationary 
within the scene. Other displays visualize the motion of the camera relative to an 
inertial reference frame, and a bird’s eye view of the reconstructed position of the 
points tracked. (Right) Structure error: the error in mutual distance between a set 
of 20 points for which the relative position is known (the squares in the checkerboard 
box on the left) are plotted for a sequence of 200 frames. Mean and standard deviation, 
both computed across the set of points at the last frame and across the last 100 frames, 
are below one millimeter. The experiment is performed off-line, and only unoccluded 
features are considered. 



4.2 Motion Error 

Errors in motion are difficult to characterize on real sequences of images, for 
external means of estimating motion (e.g. inertial, magnetic sensors, encoders) 
are likely to be less accurate than vision. We have therefore placed a checkerboard 
box on a turntable and moved it for a few seconds, going back to its original 
position, marked with a accuracy greater than 0.5mm. In figure Owe show the 
distance between the estimated position of the camera and the initial position. 
Again, the error is below 1mm. 
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Notice that in these experiments we have fixed the scale factor using the fact 
that the side of a square in the checkerboard is 2cm and we have processed the 
data off-line, so that only the unoccluded points are used. 





Fig. 2. (Left) Motion error: a checkerboard box is rotated on a turntable and then 
brought back to the initial position 10 times. We plot the distance of the estimated 
position from the initial time for the 10 trials. The ergodic mean and std are below 
one millimeter. (Right) Scale drift: during a sequence of 200 frames, the reference 
feature was switched 20 times. The mean of the shape error increases drifts away, but 
at a slow pace, reaching about one centimeter by the end of the sequence 



4.3 Scale Drift 

In order to quantify the drift that occurs when the reference feature becomes 
occluded, we have generated a sequence of 200 frames and artificially switched 
the reference feature every 10 frames. The mean of the structure error is shown 
in figure El Despite being unavoidable, the drift is quite modest, around 1cm 
after 20 switches. 

4.4 Use of the Motion Estimates for Rendering 

The estimates of motion obtained using the algorithm we have described can 
be used in order to obtain estimates of shape. As a simple example, we have 
taken an uncalibrated sequence of images, shown in figure 0 and estimated its 
motion and focal length with the model described in section 12 . 1 1 while fixing 
the optical center at the center of the image. We have then used the estimates 
of motion to perform a dense correlation-based triangulation. The position of 
some 120,000 points, rendered with shading, is shown in figure0 along with two 
views obtained from novel viewpoints. 

Although there is no ground truth available, the qualitative shape of the 
scene seems to have been captured. Sure there are several artifacts. However, we 
would like to stress that these results have been obtained entirely automatically. 
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Fig. 3. The “temple sequence” (courtesy of AIACE): one Image out of a sequence 
of 46 views of an Etruscan temple (top-left): no calibration data is available. The 
motion estimated using the algorithm presented in this paper can be used to triangulate 
each pixel, thus obtaining a “dense” representation of the scene. This can be rendered 
with shading (top-right) or texture-mapped and rendered from an arbitrary viewpoint 
(bottom left and right). Although no ground truth is available and there are significant 
artifacts, the qualitative shape can be appreciated from the rendered views. 



5 Conclusions 

The causal estimation of three-dimensional structure and motion can be posed 
as a nonlinear filtering problem. In this paper we have described the implemen- 
tation of an algorithm whose global observability, uniform observability, minimal 
realization and stability have been proven in jS]. 

The filter has been implemented on a personal computer, and the imple- 
mentation has been made available to the public. The filter exhibits honest 
performance when the scene contains at least 20-40 points with high contrast, 
when the relative motion is “slow” (compared to the sampling frequency of the 
frame grabber), when the scene occupies a significant portion of the image and 
the lens aperture is “large enough” (typically more than 30° of visual field). 

While it is relatively simple to design an experiment where the implementa- 
tion fails to provide reliable estimates (changing illumination, specularities etc.), 
we believe that the algorithm we propose is close to the performance limits for 
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causal, real-time algorithms to recover point-wise structure and motioij^. In order 
to improve the performance of motion estimates, we believe that a more “glo- 
bal” representation of the environment is needed. Using feature-points alone, we 
think this is as good as it gets. 

The next logical steps are in two directions. On one hand to explore more 
meaningful representations of the environment as a collection of surfaces with 
certain shape emitting a certain energy distribution. On the other hand, a theo- 
retically sound treatment of nonlinear filtering for these problem involves esti- 
mation on Riemannian manifolds and homogeneous spaces. Both are open and 
challenging problems in need of meaningful solutions. 
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Abstract. Background subtraction is a method typically used to seg- 
ment moving regions in image sequences taken from a static camera 
by comparing each new frame to a model of the scene background. We 
present a novel non-parametric background model and a background 
subtraction approach. The model can handle situations where the back- 
ground of the scene is cluttered and not completely static but contains 
small motions such as tree branches and bushes. The model estimates 
the probability of observing pixel intensity values based on a sample of 
intensity values for each pixel. The model adapts quickly to changes in 
the scene which enables very sensitive detection of moving targets. We 
also show how the model can use color information to suppress detec- 
tion of shadows. The implementation of the model runs in real-time for 
both gray level and color imagery. Evaluation shows that this approach 
achieves very sensitive detection with very low false alarm rates. 

Key words: visual motion, active and real time vision, motion detection, 
non-parametric estimation, visual surveillance, shadow detection 



1 Introduction 

The detection of unusual motion is the first stage in many automated visual 
surveillance applications. It is always desirable to achieve very high sensitivity 
in the detection of moving objects with the lowest possible false alarm rates. 
Background subtraction is a method typically used to detect unusual motion in 
the scene by comparing each new frame to a model of the scene background. 

If we monitor the intensity value of a pixel over time in a completely static 
scene (i.e., with no background motion) , then the pixel intensity can be reason- 
ably modeled with a Normal distribution given the image noise over 

time can be modeled by a zero mean Normal distribution 7V(0,cr^). This Nor- 
mal distribution model for the intensity value of a pixel is the underlying model 
for many background subtraction techniques. For example, one of the simplest 
background subtraction techniques is to calculate an average image of the scene 
with no moving objects, subtract each new frame from this image, and threshold 
the result. 
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This basic Normal model can adapt to slow changes in the scene (for ex- 
ample, illumination changes) by recursively updating the model using a simple 
adaptive filter. This basic adaptive model is used in P, also Kalman filtering 
for adaptation is used in ETH . 

In many visual surveillance applications that work with outdoor scenes, the 
background of the scene contains many non-static objects such as tree branches 
and bushes whose movement depends on the wind in the scene. This kind of 
background motion causes the pixel intensity values to vary significantly with 
time. For example, one pixel can be image of the sky at one frame, tree leaf at 
another frame, tree branch on a third frame and some mixture subsequently; in 
each situation the pixel will have a different color. 




Fig. 1. Intensity value overtime 




Fig. 2. Outdoor scene with a circle at the top left corner showing the location of the 
sample pixel in hgureQ 



Figure^ shows how the gray level of a vegetation pixel from an outdoor scene 
changes over a short period of time (900 frames-30 seconds). The scene is shown 
at figure |3 Figure 0-a shows the intensity histogram for this pixel. It is clear 
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that intensity distribution is multi-modal so that the Normal distribution model 
for the pixel intensity/color would not hold. 

In a mixture of three Normal distributions was used to model the pixel 
value for traffic surveillance applications. The pixel intensity was modeled as 
a weighted mixture of three Normal distributions: road, shadow and vehicle 
distribution. An incremental EM algorithm was used to learn and update the 
parameters of the model. Although, in this case, the pixel intensity is modeled 
with three distributions, still the uni-modal distribution assumption is used for 
the scene background, i.e. the road distribution. 

In [ti|7] a generalization to the previous approach was presented. The pixel 
intensity is modeled by a mixture of K Gaussian distributions {K is a small 
number from 3 to 5) to model variations in the background like tree branch 
motion and similar small motion in outdoor scenes. The probability that a certain 
pixel has intensity Xt at time t is estimated as: 



where Wj is the weight, is the mean and Sj = is the covariance for the 
jth distribution. The K distributions are ordered based on Wj /(t| and the first 
B distributions are used as a model of the background of the scene where B is 
estimated as 



The threshold T is the fraction of the total weight given to the background model. 
Background subtraction is performed by marking any pixel that is more that 2.5 
standard deviations away from any of the B distributions as a foreground pixel. 
The parameters of the distributions are updated recursively using a learning rate 
Of, where 1/a controls the speed at which the model adapts to change. 

In the case where the background has very high frequency variations, this 
model fails to achieve sensitive detection. For example, the 30 second intensity 
histogram, shown in figure EJa, shows that the intensity distribution covers a very 
wide range of gray levels (this would be true for color also.) All these variations 
occur in a very short period of time (30 seconds.) Modeling the background 
variations with a small number of Gaussian distribution will not be accurate. 
Furthermore, the very wide background distribution will result in poor detection 
because most of the gray level spectrum would be covered by the background 
model. 

Another important factor is how fast the background model adapts to change. 
Figure|3-b shows 9 histograms of the same pixel obtained by dividing the original 
time interval into nine equal length subintervals, each contains 100 frames (3^ 
seconds.) From these partial histogram we notice that the intensity distribution 
is changing dramatically over very short periods of time. Using more “short- 
term” distributions will allow us to obtain better detection sensitivity. 
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( 1 ) 




754 A. Elgammal, D. Harwood, and L. Davis 




gray value ^ ^ ^ ^ 

(a) (b) 



Fig. 3. (a) Histogram of intensity values, (b) Partial histograms 



We are faced with the following trade off: if the background model adapts 
too slowly to changes in the scene, then we will construct a very wide and 
inaccurate model that will have low detection sensitivity. On the other hand, 
if the model adapts too quickly, this will lead to two problems: the model may 
adapt to the targets themselves, as their speed cannot be neglected with respect 
to the background variations, and it leads to inaccurate estimation of the model 
parameters. 

Our objective is to be able to accurately model the background process non- 
parametrically. The model should adapt very quickly to changes in the backgro- 
und process, and detect targets with high sensitivity. In the following sections 
we describe a background model that achieves these objectives. The model keeps 
a sample for each pixel of the scene and estimates the probability that a newly 
observed pixel value is from the background. The model estimates these proba- 
bilities independently for each new frame. In section El we describe the suggested 
background model and background subtraction process. A second stage of back- 
ground subtraction is discussed in section 0 that aims to suppress false detections 
that are due to small motions in the background not captured by the model. Ad- 
apting to long-term changes is discussed in sectional In sectional we explain how 
to use color to suppress shadows from being detected. 

2 Basic Background Model 

2.1 Density Estimation 

In this section, we describe the basic background model and the background 
subtraction process. The objective of the model is to capture very recent in- 
formation about the image sequence, continuously updating this information to 
capture fast changes in the scene background. As shown in figure 0-b, the inten- 
sity distribution of a pixel can change quickly. So we must estimate the density 
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function of this distribution at any moment of time given only very recent history 
information if we hope to obtain sensitive detection. 

Let xi, CC 2 , xn be a recent sample of intensity values for a pixel. Using this 
sample, the probability density function that this pixel will have intensity value 
Xt at time t can be non-parametrically estimated using the kernel estimatoriL 
as 

1 ^ 

Pr{xt) = ~'^K{xt - x^) (3) 

i=l 

If we choose our kernel estimator function, K, to be a Normal function iV(0, S), 
where S represents the kernel function bandwidth, then the density can be 
estimated as 



Pr{xt) 




i=l 



S ^{xt-Xi) 

{2Tr)i I L7 |3 



(4) 



If we assume independence between the different color channels with a different 
kernel bandwidths for the jth color channel, then 

fal 0 0 \ 

E = \ 0 0-2 0 

V 0 0 4 / 



and the density estimation is reduced to 



Pr{xt) 



N 



N d 

En 

i=i j=i 




(5) 



Using this probability estimate the, pixel is considered a foreground pixel if 
Pr{xt) < th where the threshold th is a global threshold over all the image that 
can be adjusted to achieve a desired percentage of false positives. Practically, 
the probability estimation of equation can be calculated in a very fast way 
using precalculated lookup tables for the kernel function values given the inten- 
sity value difference, (xt — Xi), and the kernel function bandwidth. Moreover, a 
partial evaluation of the sum in equation O is usually sufficient to surpass the 
threshold at most image pixels, since most of the image is typically sampled 
from the background. This allows us to construct a very fast implementation of 
the probability estimation. 

Density estimation using a Normal kernel function is a generalization of the 
Gaussian mixture model, where each single sample of the N samples is considered 
to be a Gaussian distribution N{0,S) by itself. This allows us to estimate the 
density function more accurately and depending only on recent information from 
the sequence. This also enables the model to quickly “forget” about the past 
and concentrate more on recent observation. At the same time, we avoid the 
inevitable errors in parameter estimation, which typically require large amounts 
of data to be both accurate and unbiased. In section ITm we present a comparison 
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(a) (b) 

Fig. 4. Background Snbtraction. (a) original image, (b) Estimated probability image. 



between the two models. We will show that if both models are given the same 
amount of memory, and the parameters of the two models are adjusted to achieve 
the same false positive rates, then the non-parametric model has much higher 
sensitivity in detection than the mixture of K Gaussians. 

Figure El-b shows the estimated background probability where brighter pixels 
represent lower background probability pixels. 

2.2 Kernel Width Estimation 

There are at least two sources of variations in a pixel’s intensity value. First, there 
are large jumps between different intensity values because different objects (sky, 
branch, leaf and mixtures when an edge passes through the pixel) are projected 
to the same pixel at different times. Second, for those very short periods of 
time when the pixel is a projection of the same object, there are local intensity 
variations due to blurring in the image. The kernel bandwidth, S, should reflect 
the local variance in the pixel intensity due to the local variation from image blur 
and not the intensity jumps. This local variance will vary over the image and 
change over time. The local variance is also different among the color channels, 
requiring different bandwidths for each color channel in the kernel calculation. 

To estimate the kernel band width cr| for the jth color channel for a given 
pixel we compute the median absolute deviation over the sample for consecu- 
tive intensity values of the pixel. That is, the median, m, of | — x^+i | for 

each consecutive pair (xi,Xi+i) in the sample, is calculated independently for 
each color channel. Since we are measuring deviations between two consecutive 
intensity values, the pair (xi,Xi+i) usually comes from the same local-in-time 
distribution and only few pairs are expected to come from cross distributions. 
If we assume that this local-in-time distribution is Normal iV(/i, cr^), then the 
deviation (xj — x^+i) is Normal iV(0, 2cr^). So the standard deviation of the first 
distribution can be estimated as 

m 

0 . 68^2 



cr = 
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Since the deviations are integer values, linear interpolation is used to obtain 
more accurate median values. 

3 Suppression of False Detection 

In outdoor environments with fluctuating backgrounds, there are two sources 
of false detections. First, there are false detections due to random noise which 
should be homogeneous over the entire image. Second, there are false detection 
due to small movements in the scene background that are not represented in the 
background model. This can occur, for example, if a tree branch moves further 
than it did during model generation. Also small camera displacements due to 
wind load are common in outdoor surveillance and cause many false detections. 
This kind of false detection is usually spatially clustered in the image and it is not 
easy to eliminate using morphology or noise Altering because these operations 
might also affect small and/or occluded targets. 

The second stage of detection aim to suppress the false detections due to 
small and unmodelled movements in the scene background. If some part of the 
background (a tree branch for example) moves to occupy a new pixel, but it was 
not part of the model for that pixel, then it will be detected as a foreground 
object. However, this object will have a high probability to be a part of the 
background distribution at its original pixel. Assuming that only a small displa- 
cement can occur between consecutive frames, we decide if a detected pixel is 
caused by a background object that has moved by considering the background 
distributions in a small neighborhood of the detection. 

Let Xt be the observed value of a pixel, a:, detected as a foreground pixel 
by the first stage of the background subtraction at time t. We define the pixel 
displacement probability, to be the maximum probability that the ob- 

served value, Xt, belongs to the background distribution of some point in the 
neighborhood N{x) of x 

Ppixt) = max Pr{xt \ By) 

y€Ai(x) 

where By is the background sample for pixel y and the probability estimation, 
Pr{xt I By), is calculated using the kernel function estimation as in equation 0 
By thresholding for detected pixels we can eliminate many false detections 
due to small motions in the background. Unfortunately, we can also eliminate 
some true detections by this process, since some true detected pixels might be 
accidentally similar to the background of some nearby pixel. This happens more 
often on gray level images. To avoid losing such true detections we add the 
constraint that the whole detected foreground object must have moved from 
a nearby location, and not only some of its pixels. We define the component 
displacement probability. Pc, to be the probability that a detected connected 
component C has been displaced from a nearby location. This probability is 
estimated by 

-Pc = Pn(x) 
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For a connected component corresponding to a real target, the probability that 
this component has displaced from the background will be very small. So, a 
detected pixel x will be considered to be a part of the background only if 
{Pj^{x) > thi) A {Pc{x) > th 2 ). 

In our implementation, a diameter 5 circular neighborhood is used to deter- 
mine pixel displacement probabilities for pixels detected from stage one. The 
threshold thi was set to be the same threshold used during the first backgro- 
und subtraction stage which was adjusted to produce a fixed false detection rate. 
The threshold, </i 2 , can powerfully discriminate between real moving components 
and displaced ones since the former have much lower component displacement 
probabilities. 




Fig. 5. Effect of the second stage of detection on suppressing false detections 



Figure 0 illustrates the effect of the second stage of detection. The result 
after the first stage is shown in figure 0-b. In this example, the background 
has not been updated for several seconds and the camera has been slightly 
displaced during this time interval, so we see many false detection along high 
contrast edges. Figure 0c shows the result after suppressing detected pixels 
with high displacement probability. We eliminates most of the false detections 
due to displacement, and only random noise that is not correlated with the 
scene remains as false detections; but some true detected pixel were also lost. 
The final result of the second stage of the detection is shown in figure 0d where 
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the component displacement probability constraint was added. Figure EJb shows 
another results where as a result of the wind load the camera is shaking slightly 
which results in a lot of clustered false detections especially on the edges. After 
the second stage of detection, figure E|-c, most of these clustered false detection 
are suppressed while the small target at the left side of the image remains. 




(a) (b) (c) 

Fig. 6. b) Result after first stage of detection, (c) Result after second stage 



4 Updating the Background 

In the previous sections it was shown how to detect foreground regions given a 
recent history sample as a model of the background. This sample contains N 
intensity values taken over a window in time of size W. The kernel bandwidth 
estimation requires all the sample to be consecutive in time, i.e., N = W or 
sample ^ pairs of consecutive intensity values over time W. 

This sample needs to be updated continuously to adapt to changes in the 
scene. The update is performed in a first-in first-out manner. That is, the oldest 
sample/pair is discarded and a new sample/pair is added to the model. The new 
sample is chosen randomly from each interval of length ^ frames. 

Given a new pixel sample, there are two alternative mechanisms to update 
the background: 

1. Selective Update: add the new sample to the model only if it is classified as 
a background sample. 

2. Blind Update: just add the new sample to the model. 

There are tradeoffs to these two approaches. The first enhance detection of 
the targets, since target pixels are not added to the model. This involves an 
update decision: we have to decide if each pixel value belongs to the background 
or not. The simplest way to do this is to use the detection result as an update 
decision. The problem with this approach is that any incorrect detection decision 
will result in persistent incorrect detection later, which is a deadlock situations 
0. So for example, if a tree branch might be displaced and stayed fixed in the 
new location for a long time, it would be continually detected. 
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The second approach does not suffer from this deadlock situation since it does 
not involve any update decisions; it allows intensity values that do not belong 
to the background to be added to the model. This leads to bad detection of the 
targets (more false negatives) as they erroneously become part of the model. 
This effect is reduced as we increase the time window over which the sample are 
taken, as a smaller proportion of target pixels will be included in the sample. 
But as we increase the time window more false positives will occur because the 
adaptation to changes is slower and rare events are not as well represented in 
the sample. 

Our objective is to build a background model that adapts quickly to changes 
in the scene to support sensitive detection and low false positive rates. To achieve 
this goal we present a way to combine the results of two background models (a 
long term and a short term) in such a way to achieve better update decisions 
and avoid the tradeoffs discussed above. The two models are designed to achieve 
different objectives. First we describe the features of each model. 

Short-term model: This is a very recent model of the scene. It adapts to 
changes quickly to allow very sensitive detection. This model consists of the most 
recent N background sample values. The sample is updated using a selective- 
update mechanism, where the update decision is based on a mask M (p, t) where 
M{p,t) = 1 if the pixel p should be updated at time t and 0 otherwise. This 
mask is driven from the final result of combining the two models. 

This model is expected to have two kinds of false positives: false positives due 
to rare events that are not represented in the model, and persistent false positives 
that might result from incorrect detection/update decisions due to changes in 
the scene background. 

Long-term model: This model captures a more stable representation of the 
scene background and adapts to changes slowly. This model consists of N sample 
points taken from a much larger window in time. The sample is updated using 
a blind-update mechanism, so that every new sample is added to the model 
regardless of classification decisions. This model is expected to have more false 
positives because it is not the most recent model of the background, and more 
false negatives because target pixels might be included in the sample. This model 
adapts to changes in the scene at a slow rate based on the ratio W/N 

Computing the intersection of the two detection results will eliminate the 
persistence false positives from the short term model and will eliminate as well 
extra false positives that occur in the long term model results. The only false 
positives that will remain will be rare events not represented in either model. 
If this rare event persists over time in the scene then the long term model will 
adapt to it, and it will be suppressed from the result later. 

Taking the intersection will, unfortunately, suppress true positives in the first 
model result that are false negatives in the second, because the long term model 
adapts to targets as well if they are stationary or moving slowly. To address this 
problem, all pixels detected by the short term model that are adjacent to pixels 
detected by the combination are included in the final result. 
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5 Shadow Detection 

The detection of shadows as foreground regions is a source of confusion for 
subsequent phases of analysis. It is desirable to discriminate between targets 
and their detected shadows. Color information is useful for suppressing shadows 
from detection by separating color information from lightness information. Given 
three color variables, R, G and B, the chromaticity coordinates r, g and b are 
= r+G+b ^9 = r+g+b ^^ = R+G+B where r + 5 + 6 = 1 0 . Using the 
chromaticity coordinates in detection has the advantage of being more insensitive 
to small changes in illumination that are due to shadows. Figure 0 shows the 
results of detection using both {R, G, B) space and (r, g) space; the figure shows 
that using the chromaticity coordinates allow detection of the target without 
detecting their shadows. Notice that the background subtraction technique as 
described in section 0 can be used with any color space. 




Fig. 7 . b) Detection using (R,G,B) color space c) detection using chromaticity coordi- 
nates (r,g) 



Although using chromaticity coordinates helps suppressing shadows, they 
have the disadvantage of losing lightness information. Lightness is related to the 
difference in whiteness, blackness and grayness between different objects uni- 
For example, consider the case where the target wears a white shirt and walks 
against a gray background. In this case there is no color information. Since both 
white and gray have the same chromaticity coordinates, the target will not be 
detected. 

To address this problem we also need to use a measure of lightness at each 
pixel. We use s = R + G + B as a, lightness measure. Consider the case where 
the background is completely static, and let the expected value for a pixel be 

< r,g,s >. Assume that this pixel is covered by shadow in frame t and let 

< rt,gt,st > be the observed value for this pixel at this frame. Then, it is 
expected that a < ^ < 1. That is, it is expected that the observed value, 
St, will be darker than the normal value s up to a certain limit, as < St, which 
corresponds to the intuition that at most (1— a)% of the light coming to this pixel 
can be reduced by a target shadow. A similar effect is expected for highlighted 
background, where the observed value is brighter than the expected value up to 
a certain limit. 
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In the our case, where the background is not static, there is no single expected 
value for each pixel. Let A be the sample values representing the background for 
a certain pixel, each represented as Xi =< ri,gi,Si > and, let Xt =< rt,gt,st > 
be the observed value at frame t. Then, we can select a subset B C A of sample 
values that are relevant to the observed lightness, St- By relevant we mean those 
values from the sample which if affected by shadows can produce the observed 
lightness of the pixel. That is, B = {xi \ Xi G A/\a < < /?}. Using this relevant 

sample subset we carry out our kernel calculation, as described in sectionEl based 
on the 2-dimensional {r,g) color space. The parameters a and (3 are fixed over all 
the image. Figure El shows the detection results for an indoor scene using both the 
(i?, G, B) color space and the (r, g) color space after using the lightness variable, 
s, to restrict the sample to relevant values only. We illustrate the algorithm on 
indoor sequence because the effect of shadows are more severe than in outdoor 
environments. The target in the figure wears black pants and the background is 
gray, so there is no color information. However we still detect the target very 
well and suppress the shadows. 




(a) (b) (c) 



Fig. 8. (b) Detection using (R,G,B) color space (c) detection using chromaticity coor- 
dinates (r, g) and the lightness variable s 



6 Comparisons and Experimental Results 

6.1 Comparison 

In this section we describe a set of experiments performed to compare the de- 
tection performance of the proposed background model as described in section 
Eland a mixture of Gaussian model as described in m We compare the ability 
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of the two models to detect with high sensitivity under the same false positive 
rates and also how detection rates are affected by the presence of a target in the 
scene. 

For the non-parametric model, a sample of size 100 was used to represent the 
background; the update is performed using the detection results directly as the 
update decision, as described in section |3 For the Gaussian mixture model, the 
maximum number of distributions allowed at each pixel was IcQ. Very few pixels 
reached that maximum at any point of time during the experiments. We used a 
sequence contains 1500 frames taken at a rate of 30 frame/second for evaluation. 
The sequence contains no moving targets. Figure O shows the first frame of the 
sequence. 




Fig. 9. Outdoor scene used in evaluation experiments 



The objective of the first experiment is to measure the sensitivity of the model 
to detect moving targets with low contrast against the background and how this 
sensitivity is affected by the target presence in the scene. To achieve this goal, 
a synthetic disk target of radius 10 pixels was moved against the background of 
the scene shown in figure El The intensity of the target is a contrast added to 
the background. That is, for each scene pixel with intensity xt at time t that 
the target should occlude, the intensity of that pixel was changed to Xt + 5. The 
experiment was repeated for different values of 5 in the range from 0 to 40. The 
target was moved with a speed of 1 pixel/frame. 

To set the parameters of the two models, we ran both models on the whole 
sequence with no target added and set the parameters of the two models to 
achieve an average of 2% false positive rate. To accomplish this for the non- 
parametric model, we adjust the threshold th; for the Gaussian mixture model 
we adjust two parameters T and a. This was done by fixing a to some value and 
finding the corresponding value of T that gives the desired false positive rates. 



^ this way the two models use almost the same amount of memory: for each distribution 
we need 3 floating point numbers a mean, a variance and a weight; for each sample 
in our method we need 1 byte 
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This resulted in several pairs of parameters (a,T) that give the the desired 2% 
rate. The best parameters were a = = 98.9%. If a is set to be greater 

that 10“®, then the model adapts faster and the false negative rate is increased, 
while if the a is less than this value, then the model adapts too slowly, resulting 
in more false positives and an inability to reach the desired 2% rate. 

Using the adjusted parameters, both the models were used to detect the 
synthetic moving disk superimposed on the original sequence. Figure El-a show 
the false negative rates obtained by the two models for various contrasts. It 
can be noticed that both models have similar false negative rates for very small 
contrast values; but the non-parametric model has a much smaller false negative 
rates as the contrast increases. 




(a) (b) 



Fig. 10. ( a) False Negatives with moving contrast target (b) Detection rates with global 
contrast added. 



The objective of the second experiment is to measure the sensitivity of the 
detection without any effect of the target on the model. To achieve this a contrast 
value 8 in the range -24 to -1-24 is added to every pixel in the image and the 
detection rates were calculated for each 8 while the models were updated using 
the original sequence (without the added contrast.) The parameters of both the 
models were set as in the first experiment. For each 8 value, we ran both the 
models on the whole sequence and the average detection rates were calculated, 
where the detection rate is defined as the percentage of the image pixels (after 
adding 8) that are detected as foreground. Notice that with (5 = 0 the detection 
rate corresponds to the adjusted 2% false positive rate. The detection rates 
are shown in figure II 1 )t b where we notice better detection rates for the non- 
parametric model. 

From these two experiments we notice that the non-parametric model is more 
sensitive in detecting targets with low contrast against the background; moreover 
the detection using the non-parametric model is less affected by the presence of 
targets in the scene. 
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6.2 Results 

Video clips showing the detection results can be downloaded in either MPEG or 
AVI formats from ftp://www.umiacs.umd.edu/pub/elgammal/video/index.htm. 
Video clip 1 shows the detection results using 100 background samples. The video 
shows the pure detection result without any morphological operations or noise 
filtering. The video clip 2 shows the detection results for a color image sequence. 
FigureEl-top shows a frame from this sequence. Video clip 3 shows the detection 
results using both a short-term and a long-term model. The short-term model 
contains the most recent 50 background samples while the long-term contains 50 
samples taken over a 1000 frame time window. Figure ^2-bottom shows a frame 
from this sequence where the target is walking behind trees and is occluded by 
tree branches that are moving. 




Fig. 11. Example of detection results 



Video clip 4 shows the detection result for a sequence taken using an omni- 
directional camer^l. A 100 sample short-term model is used to obtain these 
results on images of size 320x240. One pass of morphological closing was per- 
formed on the results. All the results shows the detection result without any 

^ We would like to thank T.E. Boult, EECS Department, Lehigh University, for pro- 
viding us with this video 
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use of tracking information of the targets. Figure [T^ton shows a frame from 
this sequence with multiple targets in the scene. Video clip 5 shows detection 
result for outdoor scene on a rainy day. The video shows three different clips for 
different rain conditions where the system adapted to each situation and could 
detect targets with the high sensitivity even under heavy rain. Figure lT^ bottom 
shows a frame from this sequence with a car moving under heavy rain. 




Fig. 12. Top:Detection result for an omni-directional camera. Bottom:Detection result 
for a rainy day. 



7 Conclusion and Future Extensions 

A robust, non-parametric background model and background subtraction me- 
chanism that works with color imagery was introduced. The model can handle 
situations where the background of the scene is not completely static but con- 
tains small motions such as tree branch motion. The model is based on estimating 
the intensity density directly from sample history values. The main feature of 
the model is that it represents a very recent model of the scene and adapts to 
charges quickly. A second stage of the background subtraction was presented 
to suppress false detection that are due to small motions in the scene backgro- 
und based on spatial properties. We also showed how the model can use color 
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information to suppress shadows of the targets from being detected. A frame- 
work was presented to combine a short-term and a long-term model to achieve 
more robust detection results. A comparison between the proposed model and a 
Gaussian mixture model [om was also presented. 

The implementation of the approach runs at 15-20 frame per second on a 
400 MHz Pentium processor for 320x240 gray scale images depending on the 
size of the background sample and the complexity of the detected foreground. 
Precalculated lookup tables for kernel function values are used to calculate the 
probability estimation of equation O in an efficient way. For most image pixels 
the evaluation of the summation in equation 0 stops after very few terms once 
the sum surpasses the threshold, which allows very fast probability estimation. 

As for future extensions, we are trying to build more concise representation 
for the long term model of the scene by estimating the required sample size 
for each pixel in the scene depending on the variations at this pixel. So, using 
the same total amount of memory, we can achieve better results by assigning 
more memory to unstable points and less memory to stable points. Preliminary 
experiments shows that we can reach a compression of 80-90% and still achieve 
the same sensitivity in detection. 
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Abstract. This paper presents an approach to representing and ana- 
lyzing spatiotemporal information in support of making qualitative, yet 
semantically meaningful distinctions at the earliest stages of processing. 
A small set of primitive classes of spatiotemporal structure are proposed 
that correspond to categories of stationary, coherently moving, incohe- 
rently moving, flickering, scintillating and “too unstructured to support 
further inference” . It is shown how these classes can be represented and 
distinguished in a uniform fashion in terms of oriented energy signatu- 
res. Further, empirical results are presented that illustrate the use of 
the approach in application to natural imagery. The importance of the 
described work is twofold: (i) From a theoretical point of view a se- 
mantically meaningful decomposition of spatiotemporal information is 
developed, (ii) From a practical point of view, the developed approach 
has the potential to impact real world image understanding and analy- 
sis applications. As examples: The approach could be used to support 
early focus of attention and cueing mechanisms that guide subsequent 
activities by an intelligent agent; the approach could provide the repre- 
sentational substrate for indexing video and other spatiotemporal data. 



1 Introduction 

1.1 Motivation 

When confronted with spatiotemporal data, an intelligent system that must 
make sense of the ensuing stream can be overwhelmed by its sheer quantity. 
Video and other temporal sequences of images are notorious for the vast amount 
of raw data that they comprise. An initial organization which affords distinc- 
tions that can guide subsequent processing would be a key enabler for dealing 
efficiently with data of this nature. 

The current paper explores the possibility of performing qualitative analyses 
of spatiotemporal patterns that capture salient and meaningful categories of 
structure and which are easily recovered from raw data. These categories capture 
distinctions along the following lines: What is moving and what is stationary? 
Are the moving objects moving in a coherent fashion? Which portions of the data 
are best described as scintillating and which portions are simply too unstructured 
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to support subsequent analysis? More generally, given a spatiotemporal region of 
interest, one may seek to decompose it into a combination of such components. 
Significantly, it is shown that all of these distinctions can be based on a unified 
representation of spatiotemporal information in terms of local (spatiotemporal) 
correlation structure. 

The ability to parse a stream of spatiotemporal data into primitive, yet se- 
mantically meaningful, categories at an early stage of analysis can benefit sub- 
sequent processing in a number of ways. A parsing of this type could support 
cueing and focus of attention for subsequent analysis. Limited computational re- 
sources could thereby be focused on portions of the input data that will support 
the desired analysis. For example, areas that are too unstructured to support 
detailed analysis could be quickly discarded. Similarly, appropriate models to 
impose during subsequent analysis (such as for model-based motion estimation) 
could be selected and initialized. Further, the underlying representation could 
provide the basis of descriptors to support the indexing of video or other spa- 
tiotemporal data. The relative distribution of a spatiotemporal region’s total 
energy across the defined primitives might serve as a characteristic signature 
for initial database construction as well as subsequent look-up. Also, in certain 
circumstances the proposed analysis could serve directly to guide intelligent ac- 
tion relative to the impinging environment. Certain primitive reactive behaviors 
(say, pursuit or flight) might be triggered by the presence of certain patterns of 
spatiotemporal structure (say, patterns indicative of large moving regions). As 
a step toward such applications, this paper presents an approach to qualitative 
spatiotemporal analysis and illustrates its representational power relative to a 
variety of natural image sequences. 

1.2 Related Research 

Previous efforts that have attempted to abstract qualitative descriptors of mo- 
tion information are of relevance to the research described in the current paper. 
Much of this work is motivated by observations suggesting the inherent difficulty 
of dealing with the visual motion field in a quantitative fashion as well as 
the general efficacy of using motion in a qualitative fashion to solve useful tasks 
(e.g., boundary and collision detection) |2E1. It should be noted, however, that 
the focus of most of this work is the qualitative interpretation of visual motion or 
optical flow while the current paper is about the analysis of spatiotemporal struc- 
ture. The level of processing discussed here precedes that at which actual motion 
computation is likely to occur. Indeed, one possible use of low-level spatiotempo- 
ral structure information might be to determine where optical flow computation 
makes sense to perform. 

Recent advances in the use of parameterized models characterizing motion 
information in terms of its projection onto a set of basis flows are also of interest. 
Some of this work makes use of principle components analysis to build the basis 
flows from training data with estimation for new data based on searching the 
space of admissable parameters |S| . Other work has defined steerable basis flows 
for simple events (e.g., motion of occluding edge or bar) with subsequent ability 
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to both detect and estimate weights for a novel data set As a whole, this 
body of research is similar to the previously reviewed qualitative motion analysis 
literature in being aimed at higher-level interpretation. 

Most closely related to the current work is prior research that has appro- 
ached motion information as a matter for temporal texture analysis [l /j . This 
research is similar in its attempt to map spatiotemporal data to primitive, yet 
meaningful patterns. However, it differs in significant ways: Its analysis is based 
on statistics (e.g., means and variances) defined over normal flow recovered from 
image sequence intensity data; whereas, the current work operates directly on 
the intensity data. Further, the patterns that it abstracts to (e.g., flowing wa- 
ter, fluttering leaves) are more specific and narrowly defined than those of the 
current work. 

A large body of research has been concerned with effecting the recovery of 
image motion (e.g., optical flow) on the basis of filters that are tuned for local 
spatiotemporal orientation |ll?Sll 1 1 r.^1 1 . Filter implementations that have 

been employed to recover estimates of spatiotemporal orientation include angu- 
larly tuned Gabor, lognormal and derivative of Gaussian filters. Also of relevance 
is the notion of opponency between filters that are tuned for different directions 
of motion mm- An essential motivation for taking such an operation into 
account is the close correspondence between the difference in the response of 
filters tuned to opposite directions of motion (e.g., leftward vs. rightward) and 
optical flow along the same dimension (e.g., horizontal). While the current work 
builds directly on methods for recovering local estimates of spatiotemporal ori- 
entation, it then takes a different direction in moving directly to qualitative 
characterization of structure rather than the computation of optical flow. 

Previous work also has been concerned with various ways of characterizing 
local estimates of spatiotemporal orientation. One prominent set of results along 
these lines has to do with an eigenvalue analysis of the local orientation ten- 
sor mM- Here the essential point is to characterize the dimensionality of the 
local orientation as being isotropic, line- or plane-like in order to characterize 
the local spatial structure with respect to motion analysis (e.g., distributed vs. 
oriented spatial structure with uniform motion). Other work of interest along 
these lines includes interpretation of opponent motion operators as indicative 
of motion salience m and the exploitation of multiscale analysis of temporal 
change information for detection and tracking purposes . Overall, while these 
lines of investigation are similar to the subject of the current paper, none of 
this work has proposed and demonstrated the particular and complete set of 
spatiotemporal abstractions that are the main subject of the current paper. 

In the light of previous research, the main contribution of the current paper 
is that it shows how to abstract from spatiotemporal data a number of quali- 
tative structural descriptions corresponding to semantically meaningful distinc- 
tions (e.g., what is stationary, what is moving, is the exhibited motion coherent 
or not, etc.). Further, a formulation is set forth that captures all of the distin- 
guished properties of spatiotemporal structure in a unified fashion. 
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2 Technical Approach 

In this section, the proposed approach to spatiotemporal analysis is presented, 
accompanied by natural image examples. For the purposes of exposition, the 
presentation begins by restricting consideration to one spatial dimension plus 
time. Subsequently, the analysis is generalized to encompass an additional spatial 
dimension and issues involving spatiotemporal boundaries. 



2.1 Analysis in One Spatial Dimension Plus Time 




Fig. 1. Primitive Spatiotemporal Patterns. The top row of images depict prototypical 
patterns that comprise the proposed qualitative categorization of spatiotemporal struc- 
ture. For display purposes the images are shown for a single spatial dimension, x, plus 
time, t. The second row of plots shows the corresponding frequency domain structure, 
with axes fx and ft- As suggested by their individual titles, the categories have seman- 
tically meaningful interpretations. The lower part of the figure shows the predicted 
distribution of energy for each pattern as it is brought under the proposed oriented 
energy representation. The representation consists of four energy images components, 
\R — L\, \R-\- L\, Sx and Fx that are derived from an input image via application of a 
bank of oriented filters. For the purpose of qualitative analysis the amount of energy 
that is contributed by the underlying filter responses, R, L, Sx and Fx, is taken as 
having one of three values: (approximately) zero, moderate and large, symbolized as 
0, -I- and -I— b, respectively. 

Primitive spatiotemporal patterns The local orientation (or lack thereof) 
of a pattern is one of its most salient characteristics. From a purely geometric 
point of view, orientation captures the local first-order correlation structure of 
a pattern. In the realm of image analysis, local spatiotemporal orientation often 
can be interpreted in a fashion that has additional ramifications. For example, 
image velocity is manifest as orientation in space-time m. We now explore the 
significance of this structure in one spatial dimension, the horizontal image axis. 
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X, and time, t. Fig. Q shows x-t-slices of several prototypical spatiotemporal 
patterns that are of particular interest. 

Perhaps the simplest situation that might hold is that a region is essentially 
devoid of structure, i.e., image intensity is approximately constant or slowly vary- 
ing in both the spatial and temporal directions. In the spatiotemporal frequency 
domain, such a pattern would have the majority of its energy concentrated at the 
origin. When such regions occur where local contrast is small they can indicate 
an underlying smoothness in the material that is being imaged. For subsequent 
processing operations it is important to flag such areas as lacking enough in- 
formation to support stable estimates of certain image properties. For example, 
image registration can be led astray by blindly attempting to align structureless 
regions. This category will be referred to as “unstructured” . 

Locally oriented structures are quite common in spatiotemporal data. Here, 
there are several situations that are useful to distinguish. From a semantic point 
of view, it is of particular interest to categorize the patterns according to the di- 
rection of their dominant orientation. One case of interest is that which arises for 
the case of (textured) stationary objects. These cases show elongated structure 
in the spatiotemporal domain that is parallel to the temporal axis, i.e., features 
exhibit no shift in position with the passage of time. In the frequency domain, 
their energy will be concentrated along the spatial frequency axis. This case will 
be referred to as “static” . A second case of interest is that of homogeneous spatial 
structure, but with change in intensity over time (for example, overall change 
in brightness due to temporal variation in illumination) . Here, the spatiotempo- 
ral pattern will be oriented parallel to the spatial axis. Correspondingly, in the 
frequency domain the energy will be concentrated along the temporal frequency 
axis. This case will be referred to as “flicker”. A third case of interest is that of 
objects that are in motion. As noted above, such objects trace a trajectory that 
is slanted in the spatiotemporal domain in proportion to their velocity. Their 
energy in the frequency domain also exhibits a slant corresponding to their ha- 
ving both spatial and temporal variation. Such simple motion that is (at least 
locally) manifest by a single dominant orientation will be referred to as “coherent 
motion” . Finally, it is useful to distinguish a special case of oriented structure, 
that of multiple local orientations intermixed or superimposed within a spatial 
region. In this regard, there is motivation to concentrate on the case of two struc- 
tures both indicative of motion. Such a configuration has perceptual significance 
corresponding to oscillatory motion, shear and occlusion boundaries, and other 
complex motion phenomena that might be generally thought of as dynamic lo- 
cal contrast variation with motion. Interestingly, it appears that human vision 
represents this category as a special case as suggested by the perception of coun- 
terphase flicker pj. In the frequency domain the energy distribution will be the 
sum of the distributions that are implied by the component motions. This case 
will be referred to as “incoherent motion” . In comparison, there does not seem 
to be anything significant about something that is both static and flickering, 
beyond its decomposition into those primitives. 
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The final broad class of spatiotemporal pattern to be considered is that of 
isotropic structure. In this case, no discernable orientations dominate the local 
region; nevertheless, there is significant spatiotemporal contrast. The frequency 
domain manifestation of the pattern also lacks a characteristic orientation, and 
is likewise isotropic. Situations that can give rise to this type of structure are 
characteristically stochastic or chaotic in nature. Natural examples include tur- 
bulence and the glint of specularities on water. Owing to the perceptual mani- 
festation of these phenomena, this case will be referred to as “scintillation” . 

The essence of the proposed approach is to analyze any given sample of 
spatiotemporal data as being decomposed along the dimensions of the adduced 
categories: unstructured, static, dicker, coherent motion, incoherent motion and 
scintillation. While it is possible to make finer distinctions (e.g., exactly what 
the numerical value of the space-time orientation is), at the level of qualitative 
semantics these are fundamental distinctions to be made: Is something structured 
or not? If it is structured, does it exhibit a characteristic orientation or is it more 
isotropic and thereby scintillating in nature? Are oriented patterns indicative of 
something that is stationary, dickering or moving? Is the motion coherent or 
incoherent? It should be noted that each of the descriptions identified above 
is attached to the visual signal within a specified spatiotemporal region. The 
choice of this region generally affects the description assigned. For example, the 
motion of leaves in the wind may be coherent if analyzed over a very small 
area and time but incoherent over a larger area or time. An alternative way 
to think about the proposed decomposition is to consider it from the point of 
view of signal processing: In particular, what sort of decomposition (e.g., in the 
frequency domain) does it imply. This topic is dealt with in the next section in 
terms of a representation that captures the proposed distinctions. 



Oriented energy representation Given that the concern is to analyze spatio- 
temporal data according to its local orientation structure, a representation that 
is based on oriented energy is appropriate. Such a representation entails a filter 
set that divides the spatiotemporal signal into a set of oriented energy bands. 
In general, the size and shape of the filter spectra will determine the way that 
the spatiotemporal frequency domain is covered. In the present case, a family of 
relatively broadly tuned filters is appropriate due the interest in qualitative ana- 
lysis. The idea is to choose a spatial frequency band of interest with attendant 
low pass filtering in the temporal domain. This captures orientation orthogonal 
to the spatial axis. On the basis of this choice, a temporal frequency band can be 
specified based on the range of dynamic phenomena that are of interest for the 
given spatial band. This captures structure that is oriented in directions indica- 
tive of motion, e.g., a spatiotemporal diagonal. Finally, these characteristics can 
be complemented by considering just the temporal frequency band while spatial 
frequency is covered with a low-pass response. This captures structure that is 
oriented orthogonal to the temporal axis. Thus, it is possible to represent several 
principle directions in the spatiotemporal domain while systematically covering 
the frequency domain. The simplification realized by analyzing spatiotemporal 




774 



R.P. Wildes and J.R. Bergen 




Fig. 2. Oriented Energy Filters for Spatiotemporal Analysis. The top row shows syn- 
thesized profiles for second derivative of Gaussian filters oriented to capture static, 
flicker, rightward and leftward motion structure (left to right). The last plot is the 
Hilbert transform of the leftward motion filter. (These plots are shown greatly enlar- 
ged for clarity). The bottom row indicates the frequency response of the corresponding 
quadrature pair filters via application of an energy calculation to the zone plate at the 
far right. The proposed approach to representing spatiotemporal structure builds on 
such filtering operations. 

structure in a two dimensional representation (i.e. one spatial and one temporal 
dimension) requires somehow addressing the remaining spatial dimension since 
the input data consists of a three dimensional volume. This is done by lowpass 
filtering the data in the orthogonal spatial direction using the 5-tap binomial 
filter [1 4 6 4 1]/16. This filtering allows for analysis of the other spatiotempo- 
ral plane (i.e. that containing the orthogonal spatial dimension) in an exactly 
analogous manner. 

In the remainder of this section a choice of filters is presented for a given 
frequency response, i.e., scale of spatial structure. 

The desired filtering can be implemented in terms of second derivative of 
Gaussian filters, G 2 „ at orientation 6 (and their Hilbert transforms, H 2 g ) )l 4j . 
The motivation for this choice is twofold. First, while selective for orientation, 
the tuning of these filters is moderately broad and therefore well suited to the 
sort of qualitative analysis that is the focus of the current research. Second, 
they admit a steerable and separable implementation that leads to compact and 
efficient computation. The filters are taken in quadrature (i.e., for any given 9, 
G 2 g and H 2 g in tandem) to eliminate phase variation by producing a measure 
of local energy, Eg{x,t) within a frequency band, according to 

Eg{x,t) = {G 2 g{x,t) * I{x,t)f + (H 2 g{x,t) * I{x,t)f. (1) 

In particular, to capture the principle orientations that were suggested above, 
filtering is applied (i) oriented orthogonally to the spatial axis, (ii) orthogonally 
to the temporal axis and (iii, iv) along the two spatiotemporal diagonals, see 
Fig. El Notice that the frequency response plots show how the filters sweep out 
an annulus in that domain; this observation can provide the basis for allowing 
a multiscale extension to systematically alter the inner and outer rings of the 
annulus to effectively cover the frequency domain. Finally, note that at a given 
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frequency the value of any one oriented energy measure is a function of both 
orientation and contrast and therefore rather ambiguous. To avoid this confound 
and get a purer measure of orientation the response of each filter should be 
normalized by the sum of the consort, i.e., 



Esi {x, t) 



+ e 



( 2 ) 



where e is a small bias to prevent instabilities when overall energy is small. 
(Empirically we set this bias to about 1 % of the maximum (expected) energy.) 

The necessary operations have been implemented in terms of a steerable filter 
architecture pmsi. The essential idea here is to convolve an image of interest 
with a set of n basis filters, with n — 3 for the second derivative of Gaussi- 
ans of concern. Subsequently the basis filtered images are combined according 
to interpolation formulas to yield images filtered at any desired orientation, 
6. Processing with the corresponding Hilbert transforms is accomplished in an 
analgous fashion, with n = 4. To remove high frequency components that are 
introduced by the squaring operation in forming the energy measurement (^^l, 
the previously introduced 5-tap binomial low-pass filter is applied to the result. 
Eg. Details of the filter implementation (e.g., specification of the basis filters and 
the interpolation formulas) are provided in piUllOj . 

The final oriented energy representation that is proposed is based directly on 
the basic filtering operations that have been described. Indeed, given the class 
of primitive spatiotemporal patterns that are to be distinguished, one might 
imagine simply making use of the relative distribution of (normalized) energies 
across the four proposed orientation tuned bands as the desired representation. 
In this regard, it is proposed to make use of two of these bands directly. In parti- 
cular, the result of filtering an input image with the filter oriented orthogonally 
to the spatial axis will be one component of the representation, let it be called 
the “S'aj-image” (for static). Second, let the result of filtering an input image 
with the filter oriented orthogonally to the temporal axis be the second compo- 
nent of the representation and call it the “F 3 ,-image” (for flicker). Due to their 
characteristic highlighting of particular orientations, these (filtered) images are 
well suited to capturing the essential nature of the patterns for which they are 
named. 

The information provided individually by the remaining two bands is ambi- 
guous with respect to the desired distinctions between, e.g., coherent and inco- 
herent motion. This state of affairs can be remedied by representing these bands 
as summed and differenced (i.e., opponent) combinations. Thus, let R — L and 
R + L stand for opponent and summed images (resp.) formed by taking the 
pointwise arithmetic difference and sum of the images that result from filtering 
an input image with the energy filters oriented along the two diagonals. It can 
be shown that the opponent image (when appropriately weighted for contrast) 
is proportional to image velocity [Q and has a strong signal in areas of coherent 
motion. It is for this reason that the notation R and L is chosen to underline the 
relationship to rightward and leftward motion. For present purposes the absolute 
value of the opponent signal, |i? — L|, will be taken as the third component of 
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the proposed representation since this allows for coherency always to be posi- 
tive. Finally, the fourth component of the representation is the summed (motion) 
energy R + L. This image is of importance as it captures energy distributions 
that contain multiple orientations that are individually indicative of motion and 
is therefore of importance in dealing with incoherent motion phenomena. 

At this point it is interesting to revisit the primitive spatiotemporal pat- 
terns of interest and see how they project onto the four component oriented 
energy representation comprised of Sx, Fx, \R — L\ and R + L, see Fig. E In 
the unstructured case, it is expected that all of the derived images will contain 
vanishingly small amounts of energy. Notice that for this to be true and stable, 
the presence of the bias factor, e, in the normalization process is important in 
avoiding division by a very small factor. For the static case, not surprisingly 
the S'cc-iniage contains the greatest amount of energy. Although, there also is a 
moderate energy from the R -I- L-image as the underlying R and L responses will 
be present due to the operative orientation tuning. In contrast, these responses 
will very nearly cancel to leave the \R — L|-image essentially zero. Similarly, 
the orthogonal Fir-image should have essentially no intensity. The flicker case 
is similar to the static case, with the Sx and Flj-images changing roles. For the 
case of coherent motion, it is expected that the |i? — L|-image will have a large 
amount of energy present. Indeed, this is the only pattern where the image is 
expected to contain any significant energy. The i?-|-F-image also should show an 
appreciable response, with the other images showing more moderate responses. 
For the case of incoherent motion, the R+ L- image should dominate as both 
the underlying R and L responses should be appreciable. Again, due to finite 
bandwidth tuning the S and F images also should show moderate responses. 
Once again the |i? — L|-image should be very nearly zero. Finally, for the case 
of scintillation the S and F images should show modest, yet still appreciable 
responses. The R + F-image response should be somewhat larger, perhaps by 
a factor of two as each of the modest R and L responses sum together. Essen- 
tially no response is expected from the \R — F|-image. Significantly, when one 
compares all of the signatures, each is expected to be distinct from the others, 
at least for the idealized prototypical patterns. The question now becomes how 
well the representation captures the phenomena of interest in the face of natural 
imagery. 



Natural image examples A set of natural image sequences have been gat- 
hered that provide one real world example of each of the proposed classes of 
spatiotemporal structure, see Fig. Q For the unstructured case the image se- 
quence shows a featureless sky. For the static case the image sequence shows 
a motionless tree. (Note that for each of these first two cases, a single image 
was not simply duplicated multiple times to make the sequence, an actual video 
sequence of images was captured.) The third case, flicker, is given as a smooth 
surface of human skin as lightning flashes over time. Coherent motion is captu- 
red by a field of flowers that appear to move diagonally upward and to the right 
due to camera motion. Incoherent motion is captured by a sequence of images 
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of overlapping legs in very complex motion (predominantly, but not entirely, 
horizontal motion). The last case, scintillation, is shown via a sequence of rain 
striking a puddle. All of the image sequences had horizontal, x, and vertical, 
y, length both equal to 64 while the temporal length (i.e., number of frames) 
was 40. All of the spatiotemporal image volumes were processed in an identical 
fashion by bringing them under the proposed oriented energy representation, 
as described in the previous section. This resulted in each original image begin 
decomposed along the four dimensions, \R — L\, R+ L, and F^. 

The results of the analysis are shown in Fig. 01 For each of the natural image 
examples a representative spatial slice shows the recovered energy along each of 
the dimensions, |i? — L|, R + L, Sx and Fx- In each cell, the average (normalized) 
energy is shown for the entire spatiotemporal volume. (Note that due to the pre- 
sence of the bias, e, the sum of R+L, Sx and Fx does not necessarily sum exactly 
to unity.) In reviewing the results it is useful to compare the recovered distribu- 
tion of energies with the predictions that are shown in Fig. H Beginning with 
the unstructured case, it is seen that all of the recovered energies are vanishingly 
small, exactly as predicted. The static case also follows the pattern predicted in 
Fig.n For this case it is interesting to note that the deviation from zero in the 
Fx component is due to some fluttering (i.e., scintillation) in the leaves of the 
tree. The flicker case also performs much as expected, with a bit more energy 
in the Fx component relative to the R + L component than anticipated. For the 
case of coherent motion the pattern of energy once again follows the prediction 
closely. Here it is important to note that the depicted motion is not strictly 
along the horizontal axis, rather it is diagonal. This accounts for the value of 
R + L being somewhat larger than \R — L\ as the underlying L channel has a 
nonzero response. For the incoherent case, it is seen that while the general trend 
in the distribution of energies is consistent with predictions, the magnitude of 
R + L is not as large as expected. Examination of the data suggests that this 
is due to the Fx component taking on a larger relative value than expected due 
to the imposition of some flicker in the data as some bright objects come into 
and go out of view (e.g., bright props and boots that the people wear). Finally, 
the case of scintillation follows the predictions shown in Fig. Hquite well. Taken 
on the whole, these initial empirical results support the ability of the proposed 
approach to make the kinds of distinctions that have been put forth. Clearly the 
utility of the representation depends on its ability to distinguish and identify 
populations of samples corresponding to the various semantic categories descri- 
bed. Demonstration of this ability will require a quantitative analysis of energy 
signatures across an appropriate collection of samples and is beyond the scope 
of this paper. 

2.2 Adding an Additional Spatial Dimension 

The approach that has been developed so far can be extended to include the ver- 
tical dimension, y, by augmenting the representation with a set of components 
that capture oriented structure in y-t image planes. The same set of oriented 
filters that were used previously are now applied to y-t planes, as before with 
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the addition of a low-pass characteristic in the orthogonal spatial dimension, 
now X. This will allow for (normalized) oriented energy to be computed in the 
four directions: (i) oriented orthogonally to the spatial axis, y, (ii) oriented or- 
thogonally to the temporal axis, t and (iii,iv) along the two y-t diagonals. These 
energy computations are performed for an input image using the y-t counterparts 
of formulas o and (0). The resulting filtered images are then used to complete 
the representation in a way entirely analogous to that used for the horizontal 
case except with U and D (for up and down) replacing R and L. 

To illustrate these extensions, Fig.^lshows the results of bringing the same set 
of natural image examples that were used with the x-t analysis under the \U—D\, 
U D, Sy, Fy extensions to the representation. Here it is useful to refer to both 
the a priori predictions of Fig. Has well as the previously presented x-t empirical 
results. By and large the results once again support the ability of the approach 
to distinguish the six qualitative classes that have been put forth. Note, however, 
that for the incoherent motion case the depicted movement is predominant in 
the X direction and the value of U -\- D is correspondingly relatively low. 



2.3 Boundary Analysis 

As an example of how the proposed representation can be used for early seg- 
mentation of the input stream, we consider the detection of spatiotemporal bo- 
undaries. Differential operators matched to the juxtaposition of different kinds 
of spatiotemporal structure can be assembled from the primitive filter responses, 
R — L, R-\-L, Sx, Fx and their vertical (i.e., y-t) counterparts. To illustrate this 
concept, consider the detection of (coherent) motion boundaries. Here, the intent 
is not to present a detailed discussion of motion boundary detection, which has 
been extensively treated elsewhere (see, for example [.Sf7imi yj l. but to use it as 
an example of the analysis of spatiotemporal differential structure in general. 

Coherent motion is most directly related to the opponent filtered images 
R — L and U — D. Correspondingly, the detection of coherent motion boundaries 
is based on the information in these images. As shown in Fig. 0 combining a 
difference of Gaussians 

G{x,y,ui) - G{x,y,a2) (3) 

operator (where G{x,y,a) is a Gaussian distribution with standard deviation 
cr) with motion opponent signals yields a double opponency: The pointwise op- 
ponency R — L is combined with a spatial opponency provided by the diffe- 
rence of Gaussians and similarly for U — D. As in difference of Gaussian based 
edge-detection m. the zero-crossings in the convolution of (0 with R — L and 
U — D are indicative of boundaries in these inputs. Final boundary detection 
is based on the presence of a zero-crossing in either of the individual results 
{G{x,y,ai) - G{x,y,a 2 )) *{R- L) or {G{x,y,a{) - G{x,y, < 72 )) * {U - D). 

An example is shown in Fig. 0 Here, the difference of Gaussians m was 
realized in terms of binomial approximations to low-pass filters with cut-off fre- 
quencies at 7 t/ 8 and 7 t/ 16. A sequence of aerial imagery showing a tree canopy 
with movement relative to undergrowth due to camera motion serves as input. 
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Due to the homogeneous texture of the vegetation, the boundary of the tree is 
not visible in any one image from the sequence. Opponent motion images R — L 
and U — D were derived from this input and difference of Gaussian processing 
was applied to each of the motion opponent images. Finally, the zero-crossings 
in the results are marked. For purposes of display, the slope magnitude is cal- 
culated for the zero-crossings and summed between the two (double opponent) 
images to given an indication of the strength of the boundary signal. The re- 
sult accurately captures one’s visual impression upon viewing the corresponding 
image sequence where the apparent boundary can be traced along the left side 
as an irregular contour, then along a diagonal and finally across the top where 
it has a pronounced divot. 

3 Discussion 

3.1 Implications 

The work that has been described in this paper builds on a considerable body 
of literature on spatiotemporal filtering. The main implication of the current 
effort is that the output of such filtering can be interpreted directly in terms 
of rather abstract information, i.e., the 6 proposed categories of spatiotemporal 
structure: structureless, static, flicker, coherent motion, incoherent motion and 
scintillation. Based on the analysis presented, not all of these classes are equally 
discriminable under the proposed representation. The signatures for the classes 
structureless, static, flicker and coherent motion are quite distinct, but those for 
incoherent motion and scintillation (while distinct from the other four) differ 
from each other only in the amount of energy expected in the summed energies 
R~\- L and U + D. This state of affairs suggests that these last two categories 
might be best distinguished from each other in relative comparisons, while all 
other distinctions might be accomplished in a more independent and absolute 
fashion. This difference has implications for how the signatures can be used: The 
stronger form of distinctness supports categorical decisions about signal content 
across imaging situations; because it depends on a metric comparison, the weaker 
form probably does not. 

Operations have been described at a single spatiotemporal scale; however, 
the proposed representation is a natural candidate for multiscale extensions uni 
EIJ. Indeed, such extensions might support finer distinctions between categories 
of spatiotemporal structure as characteristic signatures could be manifest across 
scale. Two kinds of extension can be distinguished. The first is concerned with 
varying the region of (spatiotemporal) integration that is applied to the oriented 
energy measures. The second type of multiscale extension concerns the frequency 
tuning of the underlying oriented filters. A systematic extension in this regard 
would operate at a number of spatial frequency bands and, for each of these 
bands, perform the analysis for a number of temporal frequency bands. It would 
thereby be possible to tile the frequency domain and correspondingly characte- 
rize the local orientation structure of an input spatiotemporal volume. These two 
extensions serve distinct purposes that are perhaps best understood with respect 
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to a simple example. Consider a typically complex outdoor scene containing a 
tree blowing in a gusty wind and illuminated by a sunny sky with a few drifting 
clouds in it. As the tree branches sway back and forth, the corresponding image 
motion will be locally and temporarily coherent. However, over longer periods 
of time or over larger areas it will be incoherent or oscillatory. Thus, the cha- 
racterization of the spatiotemporal structure will shift from one category to the 
other as the region of analysis is extended. Now consider the effect of a cloud 
shadow passing across the tree. At a fine spatial scale (i.e. for a high spatial fre- 
quency underlying filter) it will look like an illumination variation thus having a 
component in the “flicker” category. At the scale of the shadow itself (i.e. at low 
spatial frequency) it will look like coherent motion as the cloud passes over. The 
pattern of spatiotemporal signatures taken as a function of scale thus captures 
both the structural complexities of the dynamic scene and the quasitransparency 
of complex illumination. These two types of scaling behavior are complimentary 
and taken in tandem serve to enrich the descriptive vocabulary of the approach. 

In contrast to the main message of this paper regarding the abstraction of 
spatiotemporal information to the level of qualitative descriptors, the details of 
the particular filtering architecture that have been employed are less important. 
A variety of alternatives could be employed, including oriented Gabor (e.g., m) 
and lognormal (e.g., HH) filters. Similarly, one might be concerned with issues 
of causality and use oriented spatiotemporal filters that respect time’s arrow P] 
EE^- Also, one might consider a more uniform sampling of orientation in x-y- 
t-space, rather than relying on x-t and y-t planes. Nevertheless, it is interesting 
that the fairly simple filters that were employed in the current effort have worked 
reasonably well for a variety of natural image examples. 

The type of qualitative analysis described here seems particularly suited to 
processing in biological vision systems because of the apparently hierarchical 
nature of biological computation and the importance of such factors as attention. 
It is interesting therefore to note aspects of biological processing that relate to 
the current approach. With respect to fineness of sampling in the spatiotemporal 
domain, it appears that humans employ only about 2 to 3 temporal bands, while 
making use of 6 or more spatial bands WMm- Also, there is evidence that 
biological systems combine motion tuned channels in an opponent fashion PH, 
as is done in the current work. Regarding the degree to which filter responses are 
spatially integrated (i.e., as part of computing aggregate properties of a region) 
biological systems seem to be rather conservative: Physiological recordings of 
visual cortex complex cells indicate integration regions on the order of 2 to 
5 cycles of the peak frequency m, suggesting a preference for preservation 
of spatial detail over large area summation. It also is interesting to note that 
human contrast sensitivity is on the order of 1 % S| . an amount that has 
proven useful analogously in the current work as a choice for the bias in the 
process of energy normalization Q. With regard to border analysis, part of a 
purported mechanism for the detection of relative movement in the fly makes 
use of spatially antagonistic motion comparisons PH, in a fashion suggestive of 
the approach taken in the current paper. 
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Based on the ideas of this paper, a number of applications can be envisioned 
falling into two broad areas of potential impact. The first type of application 
concerns front end processing for real-time vision tasks. In this capacity, it could 
provide an initial organization, thereby focusing subsequent processing on por- 
tions of the data most relevant to critical concerns (e.g., distinguishing static, 
dynamic and low information regions of the scene). The second type of appli- 
cation concerns issues in the organization and access of video sequences. Here, 
the proposed representation could be used to define feature vectors that capture 
volumetric properties of spatiotemporal information (e.g., space-time texture) 
as an aid to the design and indexing of video databases. More generally, the 
proposed approach would be appropriate to a variety of tasks that could benefit 
from the early organization of spatiotemporal image data. 



3.2 Summary 

This paper has presented an approach to representing and analyzing spatiotem- 
poral data in support of making qualitative yet semantically meaningful distin- 
ctions. In this regard, it has been suggested how to ask and answer a number 
in simple, yet significant questions, such as: Which spatiotemporal regions are 
stationary? Which regions are moving in a coherent (or incoherent) fashion? 
How much of the variance in the spatiotemporal data is due to overall changes 
in intensity. Where is the spatiotemporal structure isotropic and indicative of 
scintillation? Where is the data stream simply lacking in sufficient structure to 
support further inference? Also indicated has been an approach to issues re- 
garding the analysis of spatiotemporal boundaries. Further, all of these matters 
have been embodied in a unified oriented energy representation. A variety of 
empirical results using natural image data suggest that the approach may have 
the representational power to support the desired distinctions. On the basis of 
these results, it is conjectured that the developed representation and analysis 
can subserve a variety of vision-based tasks and applications. More generally, 
the approach provides an integrated framework for dealing with spatiotempo- 
ral data in terms of its abstract information content at the earliest stages of 
processing. 
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Fig. 3. Results of Testing the Proposed Representation on Natural Imagery. For each 
of the proposed primitive classes, a sequence of images that displays the associated 
phenomena was selected. Top row, left to right: featureless sky, a motionless tree, 
lightning flashing on (motionless) skin, a field of flowers in diagonal motion due to 
camera movement, legs of multiple cheerleaders in overlapping motion and rain striking 
a puddle. Each sequence has x, y, t dimensions of 64, 64, 40, respectively. The second 
row shows corresponding ®-t-slices. The next four rows show the recovered energies in 
each of four components of the representation. Each cell shows a representative spatial, 
i.e., X, y, slice of the processed data as well as the average value for the energy across 
the entire spatiotemporal volume. Overall, the results are in accord with the predictions 
of Eig. Q 
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Fig. 4. Results of Testing the Proposed Representation on Natural Imagery. The input 
imagery and general format of the display are the same as in Fig. El Four additional 
components of the representation are now shown to incorporate information in the y 
spatial dimension. The overall pattern of results are consistent with predictions. 




Fig. 5. Motion Boundary Detection. Left to right: A schematic of a double opponent 
motion operator for motion boundary detection. An aerial image of a tree canopy 
moving against undergrowth with relative motion due to camera movement. The hand 
marked outline of the motion boundary. The magnitude of the boundary signal. The 
result accurately localizes the edge of the tree against the background. 
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Abstract. Extending a differential total least squares method for range 
flow estimation we present an iterative regularisation approach to com- 
pute dense range flow fields. We demonstrate how this algorithm can 
be used to detect motion discontinuities. This can can be used to seg- 
ment the data into independently moving regions. The different types of 
aperture problem encountered are discussed. Our regularisation scheme 
then takes the various types of flow vectors and combines them into a 
smooth flow field within the previously segmented regions. A quantitative 
performance analysis is presented on both synthetic and real data. The 
proposed algorithm is also applied to range data from castor oil plants 
obtained with the Biris laser range sensor to study the 3-D motion of 
plant leaves. 

Keywords, range flow, range image sequences, regularisation, shape, 
visual motion. 



1 Introduction 



We are concerned with the estimation of local three-dimensional velocity from 
a sequence of depth maps. Previously we introduced a total least squares (TLS) 
algorithm for the estimation of this so called range flow It is shown that 
the result of this TLS algorithm can be used to detect boundaries between 
independently moving regions, which enables a segmentation. However, within 
these regions the computed flow fields are not generally dense. To amend this 
we present an iterative regularisation method to compute dense full flow fields 
using the information available from the TLS estimation. 

Most previous work on range sequence analysis focuses on the estimation of 
the 3D motion parameters of either a moving sensor or an object 141516] . Such 
approaches implicitly assume global rigidity. In contrast we are dealing with only 
locally rigid objects moving in an environment observed by a stationary sensor. 
As with optical flow calculation we initially assume that the flow field can be 
approximated as being constant within a small local aperture m In a second 
processing step this is replaced by requiring the flow field to be smooth. The 
work presented here is related to previously reported model based range flow 
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estimation on non-rigid surfaces mn]. The 3D range flow can also be recovered 
from optical flow if other surface properties such as depth or correspondences 
are available m- Some other work includes 2D range flow obtainable from a 
radial range sensor m and car tracking in range image sequences m- 

The underlying constraint equation is introduced in Sect. |3 Then Sect. 0 
recapitulates the TLS estimation technique, in particular it is described how 
sensible parameters can be estimated even if not enough constraints are available, 
see Sect. IS. 2 1 This is a generalisation of the known normal flow estimation in 
optical flow algorithms. It is also demonstrated how boundaries in the motion 
held between differently moving regions can be detected. Section E] then shows 
how a dense parameter held can be obtained exploiting the previously calculated 
information. In Sect. Owe proceed towards a quantitative performance analysis, 
where we introduce appropriate error measures for range flow. The methods 
potential is exploited on both synthetic (Sect. 16.211 and real data (Sect. 16.611 . 
Results of applying our algorithm to sequences of range scans of plant leaves are 
given in Sect. El 

The work reported here was performed with data gathered by a Biris laser 
range sensor m- The algorithm introduced could, however, be equally well used 
on dense depth maps obtained from structured lighting, stereo or motion and 
structure techniques. 

2 Constraint Equation 

Depth is taken as a function of space and time Z = Z{X,Y,T). From the to- 
tal derivative with respect to time we derive the range flow motion constraint 
equation m 

ZxX + ZyY -Z+ Zt = 0 ■ (1) 

Here partial derivatives are denoted by subscripts and time derivatives by using 
a dot. We call the 3D motion vector range flow f and introduce the following 
abbreviation f = [U V W]'^ = [X Y — Z\^ . The range flow motion constraint 
m then becomes 



ZxU + ZyR +W + Zt 



[Zx Zy 1 Zt]'^ 



u 

V 

w 

1 



= 0 . 



( 2 ) 



As this gives only one constraint equation in three unknowns we need to make 
further assumptions, this is the aperture problem revisited. 

Equation describes a plane in velocity space. If there are three mutually 
independent planes in a local neighbourhood we can compute full flow under the 
assumption of locally constant flow fields. Obviously this could easily be extended 
to incorporate linear flow variations. If there is only one repeated constraint in 
the entire considered neighbourhood only the normal flow can be recovered. As 
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this occurs on planar surfaces we call this plane flow. When two planes meet in 
the aperture we get two constraint classes, in this case it is possible to determine 
all but the part of the flow in the direction of the intersection. The flow with 
minimal norm perpendicular to the intersecting line will be called line flow Q- 
The following section describes how we can compute the described flow types 
using a total least squares (TLS) estimator. 

3 Total Least Squares Estimation 

The TLS solution presented here is an extension of the structure tensor algorithm 
for optical flow estimation PI3H|- The method may also be viewed as a special 
case of a more general technique for parameter estimation in image sequences 

PI- 

Assuming constant flow in a region containing n pixel we have n equations 
(0. With d = [Zx Zy 1 ^t]^, u = [U V W 1]^ and the data matrix D = 
[di . . . the flow estimation in a total least squares sense can be formulated 
as 



||Z?m ||2 — t min subject to uFu = 1 . 



(3) 



The solution is given by the eigenvector 64, corresponding to the smallest eigen- 
value A4 of the generalised structure tensor 



< ZxZx > < ZixZy > < Zx > < ZxZt > 

< ZyZx > < ZyZy > < Zy > < ZyZy > 

<Zx> < Zy > < 1 > < Zt > 

< ZyZx > < ZyZy > < Zy > < ZyZy > 



Here < • > denotes local averaging using a Box or Binomial Alter. The desired 
range flow is then given by 



ff 



1 

644 



ei4 

624 

634 



(5) 



As F is real and symmetric the eigenvalues and eigenvectors can easily be compu- 
ted using Jacobi- Rotations im. In order to save execution time we only compute 
range flow where the trace of the tensor exceeds a threshold ti. This elimina- 
tes regions with insufficient magnitude of the gradient. The regularisation step 
described in Sect. 0 subsequently closes these holes. 



3.1 Detecting Motion Discontinuities 

In the above we are really fitting a local constant flow model to the data. The 
smallest eigenvalue A4 directly measures the quality of this fit. In particular at 
motion discontinuities the data can not be described by a single flow and the 
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Fig. 1. Using the confidence measure to detect motion discontinuities: a synthetic 
depth map, the lower right quarter contains random noise without coherent motion, b 
X, F-component of the correct flow field, c confidence measure (t 2 = 0.1) and d TLS 
full flow. 



fit fails. This is also the case for pure noise without any coherent motion. To 
quantify this we introduce a confidence measure 



LU = 



0 

f T2-A4 \^ 

y T2+A4 J 



if A4 > T2 or tr{D) < ti 
else 



( 6 ) 



FigureQshows the obtained confidence measure for a synthetic sequence of depth 
maps. Clearly motion discontinuities and pure noise can be identified. Also the 
estimated full flow is very close to the correct flow, however this full flow can 
not be computed everywhere regardless of u). The next section explains why and 
how to deal with such situations. 



3.2 Normal Flows 



Let the eigenvalues of F be sorted: Ai > A2 > A3 > A4. Thus if A3 ~ A4 no 
unique solution can be found m- More general any vector in the nullspace of 
F is a possible solution. In this case it is desirable to use the solution with 
minimal norm. Towards this end the possible solutions are expressed as linear 
combinations of the relevant eigenvectors and that with minimal Euclidean norm 
is chosen, see App. for details. 

On planar structures all equations are essentially the same. Only the 
largest eigenvalue is significantly different (> T2) from zero. The so called plane 
flow can then be found from the corresponding eigenvector ei = [en 621 634 644]^ 
as follows 



fp — 



641 

644 + 634 + 



614 

621 

631 



( 7 ) 



Linear structures exhibit two types of constraints within the considered aperture, 
the minimum norm solution (line flow) is found from the eigenvectors ei, 62 



fi 





611 




6 i2 




641 


621 


+ 642 


622 






_ 63 i _ 




_ 632 _ 





i _ _ ^2 

1 ^41 ^42 



( 8 ) 
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Fig. 2. Example flow types: a synthetic depth map, b rendered. X — Y components of 
the estimated flow fields: c full flow, e line flow and g plane flow and X — Z components 
of the estimated flow fields: d full flow, f line flow and h plane flow. 

Figure O shows an example of the various flow types. 



4 Flow Regularisation 



We now introduce a simple iterative regularisation algorithm that computes 
smoothly varying flow fields in some previously segmented area A. Segmentation 
of the data into regions of different motions is best accomplished by means of 
the previously described threshold on the lowest eigenvalue of F, see Sect. I.S.1 1 
However, if additional knowledge about the scene is available other segmentation 
schemes may be employed. As we are given depth data such a segmentation is 
often feasible. 

We seek to estimate a dense and smooth flow field v = [U V Wy . In places 
where flow estimations from the above TLS algorithm exist we denote them /, 
computed from as appropriate. As we are now working in 3 dimensions 

and from the structure of the TLS solution given by fTKt we can use the reduced 
eigenvectors as, not necessarily orthogonal, basis for the desired solution 



b,= 



ELie 



ki 



^2i 



f = 1,2,3 . 



(9) 



Using this notation we define a projection matrix which projects onto the sub- 
space that was determined by our TLS algorithm 



- - J' 

P = BpBp where Xp > Ap+i 



A4 ~ 0 , 



Bp — \bi . . . bp] . 



( 10 ) 

( 11 ) 
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Each estimated flow vector f f^p^i constrains the solution within this subspace. 
We therefore require the regularised solution to be close in a least squares sense 

{Pv — f)^ — >■ min . (12) 

At locations where no solution has been computed obviously no such data term 
exists. To ensure smoothly varying parameters we use a smoothness term 

3 

E(v Vi)"^ — 7> min . (13) 

Obviously the use of this simple membrane model is only justified because we 
have already segmented the data into differently moving objects. If no such seg- 
mentation were available more elaborate schemes would have to be considered 
ITM . The above smoothness term usually considers only spatial neighbour- 
hoods (V = [dx,dy]^), however this is easily extended to enforce temporal 
smoothness as well (V = [dx,dy,dt]'^). 

Combining the data da and smoothness (Cl terms in the considered area 
A yields the following minimisation problem 




h(v) 



Where w, given by equation (0), captures the confidence of the TLS solution. 
The overall smoothness can be regulated by the constant a. The minimum of 
(d is reached when the Euler-Lagrange equations are satisfied 



^ _ d dh _ d dh 
dvi dx d{vi)^ dy d{vi)y 



(15) 



If an extension in the temporal domain is anticipated another term — — — — ^ 

dt d{vi)t 

has to be added. Subscripts x,y,t denote partial differentiation. Using vector 
notation we write the Euler-Lagrange equations as follows: 



dh d dh d dh 

dv dx d{vx) dy d{vy) 



(16) 



Computing the derivatives yields 



2ujP{Pv - f)- 2a 





Introducing the Laplacian Av — Vyy we get 



(17) 



LoPv — ujPf — aAv = 0 , 



(18) 



Regularised Range Flow 791 

where the idempotence of the projection matrix PP = P is used. The Lapla- 
cian can be approximated as Av = v — v, where v denotes a local average. In 
principle this average has to be calculated without taking the central pixel into 
consideration. Using this approximation we arrive at 

{ojP + at) V = av + LuPf . (19) 

This enables an iterative solution to the minimisation problem. We introduce 
A = LoP + at and get an update from the solution at step k 

= aA-^v’^ + ujA-^Pf . ( 20 ) 

Initialisation is done as = 0. The matrix A~^ only has to be computed 
once, existence of the inverse is guaranteed by the Sherman-Morrison- Woodbury 
formula IZH, see Appendix O 

4.1 Direct Regularisation 

Instead of performing a TLS analysis first one might want to directly try to 
find the flow field by imposing the smoothness constraint, in analogy to the well 
known optical flow algorithm by Horn and Schunk m- As mentioned before this 
simple smoothness term is not generally advisable, mainly because problematic 
locations (A 4 > 0) are equally taken into account. In particular it smoothes 
across motion discontinuities. On the other hand this regularisation works very 
well when a segmentation, if at all necessary, can be achieved otherwise. 

However, if the TLS algorithm is used for segmenting the data, it makes 
sense to use the thus available information. The scheme described in Sect. E] 
does usually converge much faster than direct regularisation. Yet, it is sometimes 
advisable to use the direct regularisation as a final processing step. The already 
dense and smooth flow field is used to initialise this step. Especially on real 
data, where the TLS estimate occasionally produces outliers, this post-processing 
improves the result, see Sect. Ih.ltl 

Therefore we briefly discuss how such a direct regularisation can be applied 
to sequences of depth maps. The minimisation in this case reads 

(d^it)^ -b a^(Vui)^ I dr — 7 > min . ( 21 ) 

i=l J 

h{v) 

Here we only work on the first n-1 components of tt = 1]^. Looking at the 

Euler-Lagrange equations dSI) we get 

2d'{d''^v — d 4 )— 2 aAv = 0 where d = <^ 4 ]'^ . ( 22 ) 

Again approximating the Laplacian as difference Av — v—v this can be rewritten 

(d'd'^)v — av + av — d! di 
at + d'd'^ V = av + d' d^ . 

Ai 




as 



(23) 

(24) 
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Table 1. Results on synthetic data using a = 10, ri = 15 and T 2 = 0.1. 
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An iterative solution is found using the following update 

+ diA^^d' . (25) 

Initialisation can be done by the direct minimum norm solution n to d^p = 0 
given by: 



— C?4 


di 


— Zx 


Zx 

Zy 

1 




Cl2 

ds 


^2, + Zf. + 1 



The existence of the inverse of Ai is guaranteed, see Appendix 01 

5 Quantitative Performance Analysis 

We now give a quantitative analysis of the proposed algorithm. Even though our 
algorithm can be used with any kind of differential depth maps we focus on depth 
maps taken with a laser range finder. In particular we are concerned with a Biris 
sensor PI. First we introduce the error measures used. Due to experimental 
limitations the available real data with known ground truth only contains pure 
translational movements. Thus we also look at one synthetic sequence with a 
motion field that exhibits some divergence and rotation. 



5.1 Error Measures 



In order to quantify the obtained results three error measures are used. Let the 
correct range flow be called fc and the estimated flow /g. The first error measure 
describes the relative error in magnitude 



Er = 



(ll/cll- ll/ell) 
ll/cll 



100 [%] . 



(27) 



The deviation in the direction of the flow is captured by the directional error 

fc ■ fe 

ll/cll ll/cll 



Ed = arccos 



(28) 
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Fig. 3. Synthetic sequence generated from a real depth map. a rendered depth map, 
b correct flow and c estimated flow held (a = 10, n = 25 and T 2 = 0.01). 



Even though both Er and are available at each location we only report 
their average values in the following. It is also interesting to see if the flow is 
consistently over- or underestimated. This is measured by a bias error measure 



Eb 



1 V / ll/ell -ll/cll 




(29) 



where the summation is caried out over the entire considered region. This mea- 
sure will be negative if the estimated flow magnitude is systematically smaller 
than the correct magnitude. 



5.2 Synthetic Test Data 

The performance of the TLS algorithm, described in Sect. 01 has previously been 
analysed on synthetic data PJ. Here we simply repeat the results that for low 
noise levels of less than 2%, laser range data is typically a factor 10 less noisy, 
all flow types (full, line, plane) can be estimated with less than 5% relative error 
Er and less than 5° error Ed in the velocity range of 0.5 to 3 units per frame. 
Here unit stands for the mean distance between adjacent data points, typically 
Ri 0.3mm. 

The regularisation algorithm produces excellent results on pure synthetic 
data. Instead of giving numerous such results we simply state that for the se- 
quences shown in figures Eand Qwe achieve the results given in table D It can 
be seen that when starting with a relatively dense flow field as in Fig. Q] the use 
of 100 iterations provides good results. The remaining error here is mainly due 
to small mistakes in the segmentation into different regions. On data like that of 
Fig.O where we have large areas to be filled in, far more iterations are necessary. 
Convergence can be accelerated by starting with an interpolated full flow field 
instead of a zero flow field or by employing a hierarchical method m- 

As we are unable to make real test data with other than translational motion, 
we took the depth map from one scan and warped the dat£0 with a known flow 
field. Figure shows the depth map taken from a crumpled sheet of paper. It 
can be seen that the estimated flow is very close to the correct flow. In numbers 



^ Using bicubic interpolation. 
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Fig. 4. Real test data: a laser scanner and positioners and b depth maps of the used 
test objects. Object 1 and 2 are freshly cut castor bean leaves, object 3 a toy tiger and 
object 4 a sheet of crumpled newspaper. 



we get Er = 2.1 ± 1.6%, Ed = 2.3 ± 0.8° and Ei, = 1.9% after 100 iterations. 
From the last number we see that in this particular example the estimated flow 
is systematically larger than the correct flow, this can be attributed to the very 
small velocities present in this case. 



5.3 Real Test Data 

In order to get real test data we placed some test objects on a set of linear po- 
sitioners, see Fig. 0 The positioners allow for translations along all three axes. 
As the objects are placed on a flat surface we segmented them prior to any 
computation. There is no motion discontinuity in this case and without segmen- 
ting we would have to use the background as well. Due to the lack of structure 
there this would make convergence extremely slow. Table gives some results, 
here first the indirect regularisation (Sect.^ is employed for 300 iterations with 
a = 10, Ti = 15 and T 2 = 0.01. Then the direct regularisation (Sect.^^ is used 
for another 200 iterations with a = 5. This post-processing typically improves 
the result {Er) by about 1-2%. 

Given the fact that we are dealing with real data these results are quite 
encouraging. The average distance between two data points is 0.46mm in X- 
direction and 0.35mm in Y-direction, which shows that we are able to estimate 
sub-pixel displacements. One has to keep in mind that even slight misalignments 
of the positioner and the laser scanner introduce systematic errors. 
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Table 2. Results on real data. 



object 


correct flow [mm] 


Er [%] 


E, [1 


Eh [%] 


1 


[0.0 0.0 0.48]^ 


3.0 ± 3.2 


4.5 ± 3.0 


1.5 


1 


[0.32 - 0.38 0.32]^ 


9.2 ± 5.9 


11.2 ± 6.0 


1.3 


2 


[-0.32 0.0 0.0]^ 


8.0 ± 11.6 


7.5 ± 4.6 


2.2 


2 


[-0.64 0.0 0.0]^ 


6.1 ± 8.5 


6.0 ± 3.8 


-0.3 


3 


[-0.16 -0.19 -0.32]^ 


3.5 ± 2.5 


3.2 ± 2.4 


-2.7 


4 


[0.25 0.31 0.0]^ 


8.8 ± 5.4 


4.3 ± 3.3 


5.1 



6 Plant Leaf Motion 

This section finally presents some flow fields found by observing living castor 
oil leaves. Figure 0 shows four examples of the type of data and flow fields 
encountered in this application. The folding of the outer lobes is clearly visible, 
also a fair bit of lateral motion of the leaf. The data sets considered here are 
taken at night with a sampling rate of 5 minutes. Analysis is done using the 
same parameters as in the previous section. In Fig. Et two overlapping leaves 
are observed, it is such cases that makes a segmentation based on the TLS 
algorithm very useful. If the leaves are actually touching each other it is quite 
involved to separate them otherwise. 

In collaboration with the botanical institute at the University of Heidelberg 
and the Agriculture and Agri-Food Canada research station in Harrow, Ontario 
we seek to establish the leafs diurnal motion patterns. We also hope to examine 
the growth rate of an undisturbed leaf with a previously impossible spatial and 
temporal resolution. Up to now related experiments required the leaf to be fixed 
in a plane uni- 

7 Conclusions 

An algorithm to compute dense 3D range flow fields from a sequence of depth 
maps has been presented. It is shown how the sparse information from a TLS 
based technique can be combined to yield dense full flow fields everywhere wit- 
hin a selected area. The segmentation into regions corresponding to different 
motions can easily be done based on the quality of the initial TLS estimation. 
The performance is quantitatively assessed on synthetic and real data and the 
algorithm is found to give excellent results. Finally it could be shown that the 
motion of a living castor oil leaf can be nicely captured. 

Future work includes the interpretation of the obtained flow fields from a 
botanical point of view. We also plan to test the method on depth data from 
structured lighting and stereo. 

Acknowledgements. Part of this work has been funded under the DFG research 
unit “Image Sequence Analysis to Investigate Dynamic Processes” (Ja395/6) 
and by the federal government of Canada via two NSERC grants. 
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a 



b 






Fig. 5. Example movements of castor oil plant leaves. 



A Minimum Norm Solution 

Let’s assume we have Ai > . . . > Ap > Ap+i Ri . . . A„ ~ 0 then any linear 
combination of the eigenvectors e^; i > p is a. solution to Following CHI 
(Theorem 3.8) we now describe a way to find the minimum norm solution. 

First the possible solutions are expressed as linear combinations of the rele- 
vant eigenvectors 



n 

^ gi6i = Epg where Flp = [dp+i, . . . , e„] 

2=P+1 



^l(p+l) • ■ ■ ^In 
^n(p+l) • ' • ^nn 

(30) 



The norm of p is then given by 

IIpII = a^E^EpP = g^g = . 

i 



( 31 ) 
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The additional constraint p„ = 1 can be expressed as 



Pn = 




n 

^ ^ 9i^ni ~ '^n^pd ~ 1 j 
i=p+l 



(32) 



where = [0, . . . , 0, 1]^. Equations imi and (2U can be combined using a 
Lagrange multiplier 



F{g) = 9^ 9 + K'^l.Epg) . 



(33) 



The minimum is found by setting the partial derivatives of F with respect to the 
Qi to zero. Doing so yields 



A 



A 



+ Ae„i — 0 — >■ gi-—-^e-ni 9 — ~t:E v„ 



(34) 



Substitution into gives 



A ^ _i -2 

^ ^ni^ni — f ^ A — 



-2 



i=p+l 



^niBni v’^EpE'^Vn 



(35) 



The minimum norm solution then equates to 



P = 



EpE^Vr, 
v^EpEp Vyi 



(36) 



In components this equals 



Pk = 



E n 

i=p+l CfeiCr? 






(37) 



or as vector equation 



X)i=p+1 ■ j C(rt— l)i] X)i=l i 6(n— l)i] 






1 - e' 



P ^2 

ni 



(38) 






Where we used EpE = 11 — EpE , with Ep = [ei , . . . , 6p] , in the last equality. 



B Inversion of A 

To show that A is always regular we use the Sherman-Morrison- Woodbury 
Lemma izq. It states that for a regular (n,n) matrix Q, two (n,m) matrices 
i?, T and a regular (m,m) matrix S the combination 

Q = Q + RST^ 



(39) 
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is regular if it can be shown that U := S ^ + T^Q ^Ris regular. To apply this 
to 0 = A = all + LoP we rewrite A as 

A = K + BplmBl . (40) 

Here we dropped the constants a and w without loss of generality. Thus we have 
to examine 



(41) 

The off-diagonal elements of U are given by bjbj = cos(/3y ) < 1, where f3ij is 
the angle between bi and bj. Thus U is diagonal dominant with all diagonal 
elements stricly positive and hence a symmetric positive definite matrix EH- 
This implies U is regular. Thus we conclude that for a > 0 the matrix A can 
always be inverted. 

In the direct regularisation case described in Sect. 14. l1 we encounter Ai = 
I3 -I- Thus we have to look U = \\ + d'^Ilad' which is simply a 

scalar. Hence Ai can always be inverted provided a > 0. 
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Abstract. Many computational models indicate ambiguities in the recovery of 
plane orientation from optic flow. Here we questioned whether psychophysical 
responses agree with these models. We measured the perceived tilt of a plane 
rotating in depth with two-view stimuli for 9 human observers. Response 
accuracy was higher under wide-field perspective projection (60°) than in small 
field (8°). Also, it decreased when the tilt and frontal translation were 
orthogonal rather than parallel. This effect was stronger in small field than in 
large field. Different computational models focusing on the recovery of plane 
orientation from optic flow can account for our results when associated with a 
hypothesis of minimal translation in depth. However, the twofold ambiguity 
predicted by these models is usually not found. Rather, most responses show a 
shift of the reported tilts toward the spurious solution with concomitant 
increase in response variability. Such findings point to the need for further 
simulations of the computational models. 



1 Introduction 



Plane orientation can be defined by the tilt (x) and slant (a). We call N the vector 
normal to a plane. Slant is the angle between N and the frontoparallel plane, while tilt 
is the orientation of N as projected in this plane (Fig.l). Determining the tilt of a 
planar surface is required for navigation, when climbing a slope for instance, or for 
actions like grasping flat objects. In these situations, motion parallax is a depth cue 
that reveals the 3D structure of the visual scene to biological or machine vision 
systems [19], [15]. 

The perception of tilt from optic flow has been addressed by few psychophysical 
studies. Domini and Caudek [6] found that observers estimate tilt more accurately 
than slant in multiple-view stimuli. Cornilleau-Peres et al [4] defined the winding 
angle W as the angle between the tilt and the component of the frontal translation 
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(Fig. 2). They found that the accuracy in tilt reports decreases as W increases, this 
effect being particularly strong in small-field (8° visual angle). 

A more systematic exploration of this question is found in the theoretical domain, 
where several models of tilt computation have been proposed [10], [13], [16] and give 
a thorough account for the existence of multiple solutions in the problem of tilt 
computation from optic flow. Flowever, these studies give little information on the 
performance of the corresponding algorithms in terms of accuracy and robustness to 
noise. While many have developed error analyses of the structure from motion 
problem [1], [18], [5], [3], little is known on the accuracy of orientation estimates for a 
planar scene. In general, the recovery of the motion and structure parameters seems to 
have a maximal sensitivity to image noise when the 3D translation is parallel to N 
[18], [5]. Baratoff [2] is the only author addressing the sensitivity of tilt and slant 
estimates from binocular parallax. He finds that tilt is less sensitive than slant to 
variations of the viewing geometry, and that both variables are seriously affected by 
image noise. Contrary to slant estimates, tilt computation does not require a metric of 
the visual space, and the recent interest for an ordinal, rather than metrical, 
representation of depth [12], [8] warrants a deeper understanding of the properties of 
tilt perception. 

In this respect, the human performance may help at designing simulation tests, 
since it points to two critical variables in tilt perception, namely the size of the field of 
view (FOV) and the orientation of the plane relative to its 3D motion [4]. Because the 
previous results were obtained with multiple-frame stimuli, which provides 
complementary acceleration information [9], our first goal was to develop a 
systematic exploration of human tilt perception with two-view stimuli. We evaluated 
the errors in tilt estimation, and also the position of the perceived tilt with respect to 
the stimulus tilt and the direction of frontal translation (as is well-known and shown 
in the appendix, these are the two possible solutions for tilt). Our second objective 
was to test the predictions of different computational models so as to propose new 
directions of research on tilt perception in both biological and machine vision. 



2 Preliminaries 

If the position of the eye is the origin of a XYZ coordinate system, and the Z axis 
lies along the line of sight (Fig. 1), a plane is given by the equation: 

Z = ZxX+ZyY + Zq ( 1 ) 

In what follows, we suppose that the plane moves with a rotation 
Q.= (Q.X ,^Y around the eye, and a translation T = (7^ ,7y,72) . Thus, 

—1 Fy 

T’=(Tx ,Ty) is the frontal translation and T’=tan is the angle between T’ 

Tx 

and the X axis in the frontoparallel plane. The normalized N and T are n and t , 
respectively. 

Under perspective projection, p= — can be written as 
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p = PxX+ Pyy+ Pq 



( 2 ) 



where 




— > Px =-^XPo’ Py 



= -Zy Pq , and 



1 




Frontoparallel plane 




Fig. 1. The tilt and slant of an object plane. N is the normal to the plane, x is tilt, a is slant. 




Frontoparallel plane 



Fig. 2. The winding angle of a moving plane. T’ is the frontal translation, orthogonal to the 
optical axis. N is the normal to the plane. The winding angle (W) is the unsigned angle between 
the tilt (x) direction and T’. 



The tilt rand the slant crcan be expressed as: 



T = tan 



-iPy_ 

Px 



( 3 ) 
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(J = tan 



_1 ^Px+Py 



Under orthographic projection, 



T = tan 



PO 



-1 Zy 



C7 = tan ^ i/z|. +Zy 



3 Experiment and Results 



3.1 Method 



(4) 

(5) 

( 6 ) 



• Participants 

Nine observers aged between 21 and 28 served as naive subjects for this 
experiment. All of them had normal or corrected-to-normal vision. 

• Design 

We examined the effects of two variables on the judgments of plane orientation in 
terms of tilt and slant; (1) the size of the visual stimulus (diameters 8° or 60° visual 
angle) and (2) the winding angle (W) randomized between 0° and 90° (here W is 
unsigned). The tilt T was randomly chosen between 0° and 360°. The angle of the 
rotation axis was such that the angle between the tilt and the frontal translation was 
+W or -W, and thus ranged randomly between 0° and 360°. The slant of the plane 
was 35° and the rotation amplitude between the two views was 3°. There were 8 
sessions of 108 trials for each field size. The sessions in small field and large field 
were performed alternately in random order. 

• Apparatus 

The stimulus patterns were generated on a PC, and displayed either on the 19-in. 
monitor for small visual field or on a glass-fabric screen using the Marquee Ultra 
8500 projector for large visual field. The diameter of the small stimuli was 27.5 cm 
(8° visual angle) or 2 m (60° visual angle). Both large-field and small-field displays 
had a spatial resolution of 768 pixels for the stimulus diameter, and we used an anti- 
aliasing software to achieve subpixel accuracy, each dot covering a 3x3 pixels area. 
The refresh rate was 85 Hz. 

• Stimuli 

The viewing distance was 1.96 m in small field, and 1.73 m in large field. The 
stimuli in the experiment were perspective projections of dotted planes, with dots 
spread uniformly within a circular area of the display window (Fig. 3). Each plane 
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rotated about a frontoparallel axis. In this case the component of frontal translation T’ 
is orthogonal to the rotation axis. A probe was presented in the center of the screen 
and could be adjusted by the subjects to indicate the perceived plane orientation, 
using the computer mouse. A uniform dot density was achieved in the position of the 
surface corresponding to the intermediate position between the 2 views. The stimuli 
were generated with the appropriate perspective for visual angles of 8°and 60°. The 
motion sequence was composed of two views corresponding to rotational angles of - 
1.5° and 1.5°. The duration of the two views was 0.38 + 0.015 s. The dot number for 
each stimulus was 572 + 17. Trial duration was determined by the subject and usually 
ranged around 8 s. The luminance was adjusted to 0.23 cAlm. 




Fig. 3. Reporting tilt and slant through probe adjustment. The probe is made of a 
needle and an ellipse. Subjects adjust the orientation of the needle and the small-width 
of the ellipse with the computer mouse to indicate the perceived tilt (direction of the 
needle) and slant (width of the ellipse). 

• Procedure 

The subjects were seated with head maintained in a chinrest, and the experimental 
room was dark. With an eye patch to cover the non-dominant eye, he/she was asked to 
fixate the center of the stimulus. Each stimulus was displayed in a continuous way, 
and after 3 seconds of presentation, the subjects could adjust the XY position of the 
mouse to modify the orientation of the probe superimposed on the stimulus. Upon 
completion of the adjustment, they clicked on the mouse, and proceeded to the next 
trial. 

• Data Analysis 

We partitioned the winding angles in nine intervals: 0°-10°, 10°-20°, ... 80°-90°. 
The average number of trials for each subject in each W interval was 96 (standard 
deviation 7). 

We measured the ambiguity on the tilt sign (tilt reversal) by calculating the 
percentage of trials where the unsigned tilt error ranged between 90° and 180°. 
Having corrected the responses for this ambiguity, we then used the corrected 
absolute tilt error as a measure of the performance, ranging between 0 ° and 90 °. 

In order to assess the influence of T’ (the direction of frontal translation) on the 
reported tilt, we imposed a polarity on our data and calculated a ‘asymmeterized’ 
distribution of the responses, where the angle between the tilt and T’ is always 
positive. Initially, the frontal translation is at angle H-W or -W from the tilt. In the 
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second case, we replace T’ and the reported tilt by their symmetrical values relative to 
X. Hence we obtain a new distribution, where the angle between x and T’ is always 
+W. As the distribution is not Gaussian, we did the non parametric tests using a 
software of Statistica. 

We used circular statistics to find the mean of the distributions of the reported tilt.[] 



3.2 Results 

• Effect of the field of view (FOV) on the reported tilt sign 

We find 41% of tilt reversals in small field and only 2.4% in large field. This 
confirms previous results showing that large-field perspective projection 
disambiguates the sign of the perceived tilt [4]. All subsequent results are corrected 
with respect to the sign of the tilt. 

• Effect of the FOV on the absolute tilt error 

The average absolute tilt error, as presented in Fig. 4, is always lower in large field 
than in small field, especially for large values of W. This effect (median test, ^ =52 to 
297, p<0.001) of the FOV is significant for every subject 

• Effect of the winding angle on the absolute tilt error and variability in reported tilt 
Fig. 4 shows the average absolute tilt error with respect to W in each of the W 

intervals. In small field, the average absolute tilt error increases dramatically as W 
increases. This effect is smaller in large field. The Spearman correlation of the 
absolute tilt error with W is significant for each subject in small field (overall value: 
0.572) and for 8 of the 9 subjects in large field (overall value: 0.115). 

Fig. 5 indicates that the variability in reported tilt (width of the distribution) 
increases rapidly with W in small field but less in large field. In small field, the order 
of magnitude of the variability in the tilt report corresponds roughly to W/2 (from 
50% to 72% of W for 10°<W<80°). 

• Effect of the FOV on the asymmeterized tilt distributions 

Fig. 5 and 6 shows that the trend of the reported tilt toward T’ is stronger in small 
field than in large field. The FOV has a significant effect (median test with % =77 to 
344, p<0.001) on the asymmeterized distributions for 8 of the 9 subjects. 

• Effect of the winding angle on the asymmeterized tilt distributions 

W has a significant effect on the asymmeterized distributions of the reported tilt 
(median test %^=821.5, p<0.001, in small field, and %^ = 179.2, p<0.001, in large field). 
The factor ‘subject’ has also a significant, although less prominent, effect (%^=138.1, 
p<0.001, in small field, and %^=129.4, p<0.001, in large field). 

The effect of W could be due to the fact that, during an oscillation, the tilt is 
constant in time if W=0, but varies more as W increases with a span reaching 3° 
when W=90°. This effect is small, however, and cannot account for the large standard 
deviations observed when W increases (typically above 16°). 

Fig. 5 shows the histograms of the asymmeterized tilt reports for each W interval. 
Here, the origin of the abscissae is the bisector between the tilt and the frontal 



* Due to the periodicity of tilt, we iteratively flipped the reported tilt into a single period and 
calculated the mean until we achieved the minimum variance. 
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translation. Hence, the angle -W/2 corresponds to the stimulus tilt, and W/2 to the 
direction of frontal translation T’. 

In small field, we usually observe a one-peak distribution in Fig. 5A except for 
W=80°-90°, where the distribution is flattened and tends to present two peaks at -W/2 
and -i-W/2. For other W ranges, the decline in the peak height and the shift of the peak 
toward W/2 increase with W (Fig. 5A). The means of the distribution are plotted in 
Fig. 6 for all subjects. In small-field (Fig. 6A), the means of tilt responses are 
significantly shifted towards the frontal translation direction (Wilcoxon Matched Pair 
Test, z=17.02, n=7776, p<0.001). Hence the perceived tilt lies between the stimulus 
tilt and the translation direction T’. Overall, the distributions of the reported tilt are 
centered near the bisector of the stimulus tilt and T’ . 

In large field, the distribution of reported tilts presents one peak (Fig. 5B), which is 
significantly shifted toward T’ (z=55.167, n=7776, p<0.001). This shift is, in average, 
equal to only a fraction of W (5% to 33% when W<50°), hence the dominant 
direction is the stimulus tilt, rather than the frontal translation direction. Also, this 
effect is weaker than in small-field, and significant for each category of W range only 
when W <50°. When W is higher than 50°, the shift toward T’ is not significant and 
it can even be reversed for some subjects. Thus, the dominant reported tilt is the 
stimulus tilt, although it is shifted slightly but significantly toward T’ for W <50°. 

• Shape of the response distributions 

We found a considerable positive skewness for all the distributions except for W in 
the range 80-90° in small-field. Therefore, subjects’ responses cannot be considered 
as being spread symmetrically about an average direction. Rather, we find that the 
presence of the translation direction tends to distort the shape of the distribution. 

For W<60° in small-field and all W ranges in large-field, the shape of the 
distributions in Fig. 5 was found to be significantly sharper than the normal 
distribution, and well fitted by Laplace distributions. We did the Laplace fittings on 




Fig. 4. Average absolute tilt error in each W category (2 views) 1:W=0°-10°, 2:W=10°-20°, 
3:W=20°-30°, 4:W=30°-40°, 5:W=40°-50°, 6:W=50°-60°, 7:W=60°-70°, 8:W=70°-80°, 9: 
W=80°-90°. 
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Fig. 5. Histograms of the reported tilt. The left dashed line is the position of the stimulus tilt, 
and the right one is position of the frontal translation The continuous line is the fitted Laplace 
Distrihution. Abscissae: reported tilts in degrees. Ordinates: the fraction of the number of trials. 
A and B: results in small and large field respectively. Each box corresponds to a 10 “-wide W 
interval. 
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Fig. 6. Means of the reported tilt in each W interval for each subject in small field (A) and large 
field (B). Abscissae: the W categories, 1: W=0°-10°, ... 9: W=80°-90°. Ordinates: reported tilts 
in degrees 



the distributions in a new coordinate system with the mean as the origin. 

Hence the Laplace distributions give a first approximation of the response 
distributions, but the observed skewness that we observe in general precludes its use 
for a full modelling of the data. 

We compared the responses to the sum of two Laplace distributions, centered on 
the stimulus tilt, and on the direction of frontal translation T’, respectively. We chose 
the width as equal to the width found for the range 0-10° of W, which yields the 
smallest dispersion of the responses. 

In small-field this modelling predicts a distinct peak in the direction of T’ (at 
abscissa h-W/ 2 on fig. 5), which is not observed in our results when W<80°. Hence the 
modelling by the sum of 2 distributions would require parameterizing the variance of 
each distribution. For W ranging between 80° and 90°, however, the tilt distribution in 
Fig. 5 presents two peaks in the direction of the stimulus tilt and of T’. 

These peaks are shallower than Laplace-type peaks, but present a good symmetry 
about the bisector of the stimulus tilt and T’ . 

In large-field, for W>50°, the values of the distribution for h-W/ 2 (the direction of 
T’) are close to zero, which means that the 2-peak distribution model does not hold in 
this case. 
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In conclusion, in all cases except in small-field for the W range 80-90°, the positive 
skewness of the distributions indicates a role of the translation direction, and 
precludes the modelling of the responses by a unique symmetric distribution. 
However, within these cases, our large-field data does not support the existence of a 
2-peak underlying distribution, with peaks centered on the stimulus tilt and the 
direction of the frontal translation. As for the small-field, other variables such as the 
increase of the variance for each of the separate distribution would have to be 
modelled to account for our results. 

• Verbal Reports 

All nine subjects found the task more difficult in small field than in large field. 
Eight of them reported a perception of curved surfaces, rather than planes, for large 
values of W, particularly in large field. 



4 Computational Interpretation 

This section gives a computational interpretation of the preceding results. We 
examine the optic flow equations for a plane moving in the 3D space, with application 
to the particular case of rotation in depth, which is the motion used in our 
experiments. 



4.1 Choice of Projection Model and Assumptions 

For a plane rotating in depth, the second-order optic flow is small, as compared to 
the first-order optic flow. The ratio of the magnitude of second-order flow, divided by 
the magnitude of the first-order flow is equal to x/tan(a), where x is the angular 
eccentricity. This ratio is 0.1 and 0.82 for our small field and large field stimuli 
respectively. Therefore, the second-order flow is likely to play little role in small 
field. Hence we distinguish here the affine and full-flow processing for perspective 
projection. 

Many authors have used the orthographic projection as an approximation to small- 
field perspective projection [11]. However, this approximation does not hold for 
translations in depth, which create no optic flow in orthographic projection, but do 
yield an image expansion or contraction in perspective projection, even in small-field. 

Therefore the advantage of the perspective affine scheme is that it is quantitatively 
valid in small field for our stimuli, yet makes no a priori hypothesis on the 3D 
movement used. Note that, the perspective affine approximation would not hold for a 
curved surface, which can induce large second-order flows even in small field [7]. 

• Perspective projection and full flow 

The optic flow field is [13], [16] 

2 

u = ai+a2X +a2y + a-;X +a^xy 

2 

V = a^+ a^x +a^y + a-jX y + a^y 



where 
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'^1 = PO + 

«2 =TxPx~TzP0 

«3 =TxPy ~^Z > 

04 =TyP0 ~^X ’ 

05 = 7 ^/ 7 ^ +t2z ’ 
«6 = TyPy -TzPo, 
Ui =Q.y -TzPx ’ 

«8 =-^X --PyT’z • 



Solving these nonlinear equations leads to a twofold ambiguity, with an 
interchangeable role of vectors n and t [14]. Since the Z components of these 

vectors are the cosinus of the slant (for n ) and the translation in depth (for t ), 
respectively, it follows that 

(1) because our stimuli have no translation in depth, the wrongly perceived 
orientation should correspond to a slant of 90°, i.e. a surface normal to the 
frontoparallel plane, which normally yields no optic flow ; 

(2) if the subject assumes that =0, he should perceive the correct tilt (the same 
conclusion is reached for the hypothesis =0). 

Therefore the theoretical analysis of the full second order flow indicates that under 
large field, the perceived tilt should always be unambiguous if the subject uses one of 
the above hypotheses, or if he rules out the case of the spurious orientation. 

• Perspective projection and ajfine flow 

Using the affine flow alone will yield an infinite number of solutions. The affine 
flow is characterized by the six coefficients of the optic flow 

equations: ai, 02 , 0 ^, 04 , . Denoting =TzPq, and from a 2 ,Cl^,Cl^,Cl^, we 

obtain 

«2 + «6 .J, _ '^XPx+TyPy 

2 ^ 2 

^5~^3 o _^yPx~TxPy 
iiy — 

2 2 

which can be simplified to 



(T,-C7-)2+(f2z-Cn)2 



( 8 ) 



where 



_ ^ 2+^6 (-^^ flZ^and +(*^5 +^ 3 )^ 



2 



2 



4 
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Thus, the infinite number of solutions to the 3D problem is parameterized by the 
position of the point {T^ ,Q.i) o'^ the circle C given by equation (8). 

In our experiment, we have = ^^2 =0’ which means that the circle C of equation 
(8) passes through the origin. 

As demonstrated in the Appendices (I and II), under the a priori assumption 
72=0, we obtain two solutions for the tilt if and one solution only if W=0. 
Alternatively, if the hypothesis is that Q .2 =0, we find two solutions for W^O and 
only one for W=90. Hence, the tendency of observers to report an erroneous tilt 
direction might have an interpretation in terms of the a priori position of the couple 
(0.2 ,T^) on this circle C. 

• Orthographic Projection 

Under orthographic projection, the optic flow is equal to the frontal translation of 
the 3D point: 



JJ — “^Oy^O Oy Z ^ Xi + (Oy Zy — O 2^Y 
V = Ty -0 xZq+(02 -OxZx)X -O xZyY 



( 9 ) 



We can subtract from the optic flow the velocity vector in the center of the image 
(at X=Y=0). The resulting flow is as if 

Ty 



Oy 



Oy — 



"0 



Substituting these values of £2 and Oy into (9) leads to a system equivalent to 



Table 1. Number of solutions for the computation of tilt from optic flow.The numbers in each 
triplet indicate the numbers of solution for x when W=0°, 0°<W<90°, and W=90°, respectively. 





No Assumption 


Hypothesis: T 2 =0 


Hypothesis: 02=0 


projection Type 


Number of solutions for x 


Perspective Full 
Flow 


(1,2, 2) 


(1, 1, 1) 


(1, 1, 1) 


Perspective 
Affine Flow 


(^ 03 ^ 03 ^ <30 ^ 


(1,2, 2) 


(2, 2, 1) 


Orthographic 


(1,2, 2) 


(1,2, 2) 


(1, 1, 1) 



the perspective affine scheme associated with the hypothesis ^2=0. Hence, there are 
generally two solutions for the tilt direction if W^. These two merge in one if W=0. 
If we make the hypothesis that O 2 =0, the alternative ‘spurious’ solution is 
eliminated. 
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• Summary 

A summary of the number of solutions with respect to the projection type and a 
priori conditions is listed in Table 1. 



4.2 Comparison with Our Results 

In our results, we distinguish three possible directions for the reported tilt, namely 
the stimulus tilt, the direction of frontal translation T’, and their bisector. The 
distributions of reported tilt tend to be centered 
on the stimulus tilt in large-field, 

on the stimulus tilt for W ranging between 0 and 10° in small-field, 
on the bisector for W>10° in small-field. 

We can interpret our results as the consequence of a twofold ambiguity when the 
stimulus tilt and frontal translation are not colinear in small-field, and of a single tilt 
percept in all other cases. In this sense, our results support the validity of the 
hypothesis T^=0, because it yields a unique solution in the perspective full flow 
model, and an ambiguity on the computed tilt for W>0° in the perspective affine 
scheme (Table 1). 

However this modelling is too simple to explain our results in more details. First, 
even when the presence of the frontal translation has a strong effect (small-field, 
W>10°), we usually do not observe a clear 2-peak distribution of the reported tilt, 
except for the W range 80-90°. Rather, the variability of the responses increases 
strongly with W, and a general flattening of the distributions is observed. Also, if 
there exists a twofold ambiguity, the influence of the stimulus tilt solution is stronger 
than that of the T’ solution. Indeed, for 10°<W<80°, the distribution is asymmetric 
and shifted slightly from the bisector toward the stimulus tilt. Second, for 
intermediate values of W (lower than 50°) in large-field, the distribution center is 
close to the stimulus tilt, with a significant shift toward the frontal translation 
direction. 



5 Discussion 

In summary, we find that 

(1) The FOV has a critical influence on the tilt reports. The accuracy of tilt reports 
increases in large-field, and tilt report distributions differ strongly in small and large 
field. 

(2) Tilt reversals are observed in small field but not in large field. When corrected 
with respect to tilt reversals, the average absolute tilt errors are smaller in large field 
than in small field. 

(3) The winding angle W significantly affects the performance on tilt perception in 
small field, and to a lesser extent, in large field. The absolute tilt error increases 
rapidly in small field when W increases. 

(4) W has a significant effect on the asymmeterized distributions of the perceived 
tilt. The reported tilt is generally shifted toward the direction of frontal translation. In 
small field, when W ranges between 80° and 90° the distribution of tilts presents one 
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peak at the stimulus tilt and one peak at the frontal translation direction. Such two- 
peak shape is not found for other W ranges, or in large-field. 

(5) The results in large-field are well predicted by the second-order full flow 
modelling with additional constraints. However, the increase of the variability of the 
responses, and their shift toward the direction of frontal translation direction for 
W<50°, still remain to be explained theoretically. 

(6) In small-field, the perspective affine approach with the hypothesis 172 1 
minimum, and the orthographic modelling account for the two-peak distribution 
observed for when W is higher than 80°. For lower values of W, an explicit two-peak 
distribution is not found. Rather, the responses are centered at the bisector of the 
stimulus tilt and direction of the frontal translation, and their variability increases. 
However, the absence of a hidden twofold ambiguity for 10°<W<80° in small field 
still remains to be demonstrated. 

Our results fully confirm those obtained by Cornilleau-Peres et al. [4], despite the 
difference in number of views displayed (72 in their studies, 2 in ours). There is a 
similar dependence of the accuracy in the tilt report on W. The agreement is also 
quantitative, as the average absolute tilt errors are similar (Table 2). Our results are 
also in good agreement with those obtained by Domini, et al.’s [6] in spite of the 
difference in number of views (1-83 in their studies). The tilt reports found by these 
authors show standard deviations that can be estimated at around 20° from their 
figure. Our standard deviations are slightly worse (31.5°) in similar conditions (small- 
field, all W confounded). This could be due to the choice of the direction of the 
rotation, which is random in our experiment, and fixed in theirs. In the latter case the 
subjects may have been helped by predetermining this direction, thereby improving 
their tilt perception across trials. 



Table 2. Average absolute tilt error in our and Cornilleau-Peres, et aT s results. Wl: W=0°- 
30°, W2: W=30°-60°, W3: W=60°-90°. 





Average Absolute Tilt Error (deg) 


Small Field 


Large Field 


Wl 


W2 


W3 


Wl 


W2 


W3 


Our results 
a=35° 


13.8 


23.7 


40.41 


10.7 


14.7 


14.53 


CP, 
et aT 
results 


a=30° 


11 


19.5 


45 


12 


17 


23 


a=45° 


13 


19 


34 


9 


12.5 


16 



The results of this paper have consequences for the experimental evaluation of 
plane recovery algorithms. To have an appropriately stringent test of an algorithm, 
one should assess it against known hard situations. Our results provide 
psychophysical hints of such problem conditions. Therefore, while it has been 
claimed that under small field of view, weak perspective algorithms should be used 
because of their robustness, a more complete comparison should also test against the 
algorithm’s behavior under different winding angles. Only by carrying out such 
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motivated and controlled experiments, one will understand an algorithm’s limits of 
applicability. 

In conclusion, our results demonstrate a strong effect of the stimulus size, and a 
clear anisotropy of tilt perception related to the orientation of the plane with respect to 
its movement. There is a general tendency for the perceived tilt to shift toward T’. 
Our results point to the crucial need for extensive sensitivity analysis of the 
computation of orientation from motion. The models proposed so far express the 
computation in terms of the presence or absence of a twofold ambiguity, whereas the 
distributions of reported tilt usually present a shift toward the spurious solution, rather 
than an explicit split into the two ambiguous solutions. 
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Appendix 



I. Solutions under the Perspective Affine Scheme with =0 



We only have the first order terms. Using a 2 ,a^,a^,a(^, we have 

a2 = TxPx’ 

^3 + <^5 = PyTx + Px^Y > 

«6 = TyPy 

P T 

When a 2 we get two equations with two unknowns — — and — ^ 

Px Tx 

_ Py Ty 

«2 Px Tx ’ 

‘^3 + ‘^5 

Px 



0-2 



Py I Ty 
Tx 



Due to the quadratic nature of the equations, we usually get two solutions for the 
tilt. They can be shown to be the true tilt and an alternative solution corresponding to 
T’, up to 180°. If the tilt is parallel to the frontal translation, i.e., the winding angle is 
zero, the two tilt solutions merge as one. 

As we can express T’ in terms of x and W as x -i-W, the alternative solution for the 
tilt can be written as x -tW. 

If U 2 = Tx = 0 , it is still the same case: 

(1). If 03 - 1-05 A 0 , which indicates that T’ is not parallel to the tilt direction, as 
(Tx ,Ty)» (Py,Px) 0 , we still have two solutions: 



0X’ = 90° 
Py _ «6 

Px 



t = 90° 

ZL = - 

Tx ^3 + ^5 



«6 



«3+«5 

(2). If 03 -h 05 = 0 , which indicates T’ is parallel to x direction, thus winding angle 
is zero, there is a unique solution:x= T’ =90°. 

In summary, when Tx=0, we usually have two solutions for tilt direction, unless 
when W=0°. 
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II. Solutions under Perspective Affine Scheme with 

Using have 

a3=TxPy, 

02-06 =Tx Px-TyPy , 

05 = TyPx 

Following the same method as in Appendix I, we will have one solution when 
W=90° and two when W^0°. 



III. Solutions under Orthographic Scheme 

We denote the coefficients of the optic flow under orthographic projection as: 

^\=Tx +^yZq 
0 2= Q.yZx 
^yZy — 

a’^=Ty -Q^xZq 

^’s = ^Z~^xZx 

a’(,= -Q.xZy 

We can subtract from the optic flow the velocity vector in the center of the image 
(X=Y=0). The resulting flow is such that 




Substituting these values of Q.x and Q.y into a ’2 , fl ’3 , fl ’5 , a’g and following the 

same method as in Appendix I, we usually have two solutions except at W=0°, when 
they merge as one. 
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Abstract. This paper proposes a new method for effecting feature correspon- 
dence between images. The method operates from coarse to fine and is superior to 
previous methods in that it can solve the wide baseline stereo problem, even when 
the image has been deformed or rotated. At the coarsest level a RANSAC-style 
estimator is used to estimate the two view image constraint TZ which is then used to 
guide matching. The two view relation is an augmented fundamental matrix, being 
a fundamental matrix plus a homography consistent with that fundamental matrix. 
This is akin to the plane plus parallax representation with the homography being 
used to help guide matching and to mitigate the effects of image defoimation. 

In order to propagate the information from coarse to fine images, the distribution of 
the parameters © of 7?, is encoded using a set of particles and an importance sam- 
pling function. It is not known in general how to choose the importance sampling 
function, but a new method “IMPSAC” is presented that automatically generates 
such a function. It is shown that the method is superior to previous single resolution 
RANSAC-style feature matchers. 

Keywords: Structure from motion, Stereoscopic vision. 



1 Introduction 



The goal of this work is to obtain accurate matches and image relations between cons- 
ecutive images, with the ultimate aim of recovering 3D structure and camera projection 
matrices from an uncalibrated image sequence (such as might be obtained from a hand- 
held camcorder) where the motion is unlikely to be smooth or known a priori. Once 
the matches and two view image relation have been recovered, they can be used for 
image compression, or as a basis for building 3D graphical models from an image se- 
quence 11212212811 . These are underpinned by the need to match tokens/features (usually 
interest points) successfully through image sequences with a large number of frames. 
It transpires that the correspondence problem is one of the most difficult parts of struc- 
ture recovery, especially when these images are far apart (the wide baseline problem) 
or when they undergo large rotations (the image deformation problem). Small baseline 
image matching technology has made large advances over the past decade III2i:-lll ifTTl 
1221261301 . but there has been comparatively little progress in wide baseline matching 
technology. Furthermore, the small baseline methods do not work on every image pair. 
For example, feature based cross correlation methods may fail if (1) there are insufficient 
features in the image pair, (2) there is too much repeated structure for features to get 
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a good match, or (3) there is an image deformation that causes the cross correlation to 
fail. 

There has been some work on rectifying these problems. Pritchett and Zisserman [E'J 
present a set of recipes for special cases, but no unified theory of how to solve the problem 
in general. Cham and Cipolla f5] present a multi scale method for feature matching when 
making mosaics. The work is valid only if there is no parallax, i.e. if the image motion 
is governed by a homography. Furthermore the formulation is flawed as it propagates 
parameters using the estimate at the coarser level as a prior for the estimate at the finer, 
but since the images at fine and coarse resolution are not independent, the prior and 
likelihood are not independent. This leads to an erroneous posterior, which is then used 
(in their method) as a prior for the next level, compounding the error. 

The method presented here solves the image deformation and wide baseline matching 
problems. It also requires no camera calibration. A coarse to fine approach is adopted 
in which information about the epipolar geometry is passed from the coarser levels to 
the finer. Ideally, the information to be transmitted would be the posterior distribution of 
the parameters at the coarser level. Encoding this posterior distribution and its relation 
to the hner level is an intricate task, not least because the normalization constant of 
the distribution is unknown. Three powerful statistical methods are enlisted to create a 
solution: (1) to represent the distributions as a set of particles, (2) the use of importance 
sampling to generate unbiased draws from the posterior distribution, (3) RANSAC to 
generate the importance sampling function. In this way the posterior distribution at 
the coarse level is used as an importance sampling function to draw samples from the 
posterior distribution at the hner level. As a result, the epipolar geometry is estimated 
by using features at many different scales, solving the problem of having to select this 
scale manually. 

A fundamental component of several existing algorithms is the use of epipolar geome- 
try to simplify the search for correspondences between view pairs, particularly because 
epipolar geometry and matches consistent with this geometry may be computed simul- 
taneously, using only features in each view. Two images of a rigid object are related by a 
fundamental matrix, or in special cases just by a homography. The types of two view rela- 
tions that might arise are described in Section|2l and the likelihood of the matches given 
these relations in Section im Existing geometry based matching methods are revie- 
wed in SectionEl they comprise two stages: (a) estimate best cross correlation matches, 
(b) estimate epipolar geometry using a robust estimator. However this approach breaks 
down for the image deformation and wide base line cases. In Section^the coarse to fine 
algorithm is outlined, and the wide base line problem overcome, but cross correlation 
still fails if there is image deformation. This is because matches are initially scored by a 
combination of their cross correlation score and their agreement with epipolar geometry. 
However in order to calculate the cross correlation the deformation of each image patch 
must be known. Thus an image deformation homography is estimated in addition to the 
epipolar geometry, leading to a plane plus parallax representation. Local patches may 
be warped by the image deformation homography to establish cross correlation scores. 
This combined set of parameters is referred to as the augmented fundamental matrix 
and is described in Sectional The results are given in SectionQ where the algorithm is 
demonstrated on the wide baseline and the image deformation problems. 
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Notation The image of a 3D scene point X is in the first view and in the second, 
where x^ and x^ are homogeneous three vectors, x = (x, y, 1)^. The correspondence 
x^ O x^ will also he denoted as x^’^. Throughout, underlining a symbol x indicates the 
perfect or noise-free quantity, distinguishing it from x = x + Ax, which is the measured 
value corrupted by noise. 



2 The Two View Relations 

Within this section the possible relations TZ on the motion of points between two views 
are summarized. Four examples of TZ are considered: (a) the Fundamental matrix Q 
IT 2 I . (b) the affine fundamental matrix M (c) the planar projective transformation (a 
homography), and (d) the affinity. All these two view relations are estimable from image 
correspondences alone. 

The epipolar constraint is represented by the Fundamental matrix no . This rela- 
tion applies for general motion and structure with uncalibrated cameras; consider the 
movement of a set of point image projections from an object which undergoes a rotation 
and non-zero translation between views. After the motion, the set of homogeneous image 
points {xi}, i = 1, ... n, as viewed in the first image is transformed to the set {x/} in 
the second image, with the positions related by 

xrF^=o (1) 

where x = (x,y,l)^ is a homogeneous image coordinate and F is the Fundamental 
Matrix. The affine fundamental matrix F^ is the linear version of F. The affine camera 
is applicable when the data is viewed under orthographic conditions and gives rise to a 
fundamental matrix with zeroes in the upper 2 by 2 submatrixU and it is studied in detail 
by Shapiro |(2D(|. 

In the case where all the observed points lie on a plane, or the camera rotates about 
its optic axis and does not translate, then all the correspondences lie on a homography: 

x' = Hx . (2) 

The affinity Hyi is a special case of the homography with zeros for the first two elements 
of the bottom row. Again it is valid under uncalibrated orthographic conditions. 



2.1 Likelihood of a Match Given a Relation 

In this section, the maximum likelihood formulation is given for computing any of the 
multiple view relations, given a set of matches. Later this formalism will be extended to 
include the case when the matches themselves are unknown and must be estimated. In 
the following we make the assumption that the noise in the two images is Gaussian on 

* Actually Fa occurs in the non-orthographic case when the optical planes of the two cameras 
coincide HI . Affine reconstruction in this case gives projectively correct results. 
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each image coordinate with zero mean and uniform standard deviation a. Thus, given a 
true correspondence, the probability density function of the noise perturbed data is 



p(xl’2|7^)= Yl 



V 






(3) 



where n is the number of correspondences and TZ is the appropriate 2 view relation, e.g. 
the fundamental matrix or projectivity. 

The above derivation assumes that the errors are Gaussian. Often, however, features 
are mismatched and the error on the match is not Gaussian. Thus the error can be modelled 
as a mixture model of Gaussian and uniform distribution:- 



where 7 is the mixing parameter and v is just a constant, a is the standard deviation of 
the error on each coordinate. To correctly determine 7 and v entails some knowledge of 
the outlier distribution; here it is assumed that the outlier distribution is uniform, with 
— I . . + I being the pixel range within which outliers are expected to fall (for feature 
matching this is dictated by the size of the search window for matches). Therefore the 
error minimized is the negative log likelihood: 



-L = - 



log 7 






-E + (1-T)- 

\ i=i,2 



Given a suitable initial estimate there are several ways to estimate the parameters of 
the mixture model, most prominent being the EM algorithm 116: 1611 ■ but gradient descent 
methods could also be used. Because of the presence of outliers in the data the standard 
method of least squares estimation is often not suitable as an initial estimate, and it is 
better to use a robust estimate such as RANSAC which is described in the next section. 



3 Random Sampling Guided Matching 

Within this section the state of the art in feature matching is described. This computation 
requires initial matching of points (e.g. comers detected to sub-pixel accuracy by the 
Harris corner detector mm) between two images; the aim is then to compute the relation 
from these image correspondences. Given a corner at position (x, y) in the first image, the 
search for a match considers all corners within a region centred on (x, y) in the second 
image with a threshold on maximum disparity. The strength of candidate matches is 
measured by sum of squared differences in intensity. At this stage, the threshold for 
match acceptance is deliberately conservative in order to minimise incorrect matches. 
Nevertheless, many mismatches will occur because the matching process is based only 
on proximity and similarity. These mismatches (called outliers) are sufficient to render 
standard least squares estimators useless. Consequently robust methods must be adopted, 
which can provide a good estimate of the solution even if some of the data are outliers. 

There are potentially a significant number of mismatches amongst the initial matches. 
Since correct matches will obey the epipolar geometry, the aim is to obtain a set of 
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“inliers” consistent with the epipolar geometry using a robust technique. In this case 
“outliers” are putative matches which are inconsistent with the epipolar geometry. Robust 
estimation by random sampling (such as MLESAC, LMS or RANSAC) have proven the 
most successful 11812412912611 . These algorithms are well known and briefly summarized 
in Fig. [D 



Table 1. A brief summary of all the stages of random sampling guided matching 



1 . Detect comer features using the Harris comer detector Ga. 

2. Putative matching of comers over the two images using proximity and cross correlation 
to get best set of matches. 

3. Repeat for a fixed number of samples or until “jump out” fZl occurs 

a) Select a random sample without replacement of the minimum number of correspon- 
dences required to estimate the relation TZ 

b) Estimate the unique image relation TZ consistent with this minimal set. 

c) Calculate the error —L for all matches (MLESAC), or the median of residuals (LMS), 
or the number of inliers (RANSAC). 

4. Select the best solution over all the samples i.e. that which minimizes —L (MLESAC), 
or that which minimized the median error (LMS), or that which maximized the number 
of inliers (RANSAC). 

5. Minimize robust cost function over all correspondences using gradient descent. 



3.1 Problems with Conventional Matching 

There are two types of failure mode for the class of matching algorithms in Table □ The 
first is the wide baseline case, see Figure 0 which shows two images taken at the same 
time instant0 where the disparity is 160 pixels. In the conventional algorithm, described 
above, a search window must be set for putative matches. If this search window is too 
large (which it must be in this case to guarantee that the correct match lies within it), 
then there is a combinatorial explosion of putative matches. This leads to a catastrophic 
failure of correlation matching as there are too many potential false matches for each 
corner. The second failure mode is caused when the image is rotated (see Figure O. 
In this case, standard correlation matching cannot be expected to succeed, because the 
correlation score is not rotationally invariant. Using a rotationally invariant correlation 
score does not correct this problem; instead it reduces the discriminating power of the 
score, increasing the number of mismatches even when the second image is not rotated. 
The answer to both these problems, presented here, is to adopt a coarse to fine strategy. 
The coarse to fine strategy has been used successfully for small baseline homography 
matching 0, but neglected for feature matching. 

^ Kindly provided by Dayton Taylor 
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Fig. 1. Wide Baseline Failure of MLESAC/LMS/RANSAC: 50 matches from the first and last 
images of the Samsung sequence. The images are were imaged at the same time instance and 
are two of 50 taken from a 50 camera stereo rig. The features are shown in each image (circles) 
together with the line joining them to their correspondence in the other image, and are matched 
with an affine fundamental matrix. Although several of the features with small disparities have 
been correctly matched, features with large disparities are incorrectly matched. This is because, 
as the disparity increases, so does the number of potential mismatches. 




Fig. 2. Catastrophic Failure of MLESAC/LMS/RANSAC Due To Rotation: the second image 
in the Zhang sequence has been rotated by 90 degrees, in addition there is a slight change of 
pose of the head. The image correlation used is not invariant to rotation, so there are too many 
mismatches for MLESAC to converge. Rotation- invariant correlation is not a solution to this 
problem, because it is less discriminating and thus results in too many mismatches even when the 
second image is not rotated. 
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4 Coarse to Fine 

In the coarse to fine strategy, an image pyramid is formed by subsampling the image 
repeatedly by a factor of 2. At the coarsest level of this pyramid (level I = 0), the 
distribution of the parameters 9 of the relation TZ given the data D; is p{9 |D;). The 
information contained in this posterior distribution should be propagated down to the 
finer levels. One way to propagate information from one level to the next is to simply 
propagate down the mode of this distribution. However, at the coarsest levels this dis- 
tribution is not expected to have a strong peak and often propagation of the mode does 
not convey sufficient information. Too soon a commitment to a single hypothesis may 
cause the algorithm to converge to the wrong solution. Rather, it is desirable to pass as 
much of the distribution as possible from one level to the next. 

The coarse to fine strategy is beneficial for a number of reasons. It furnishes a 
solution to the wide baseline problem because the search window, and thus the number 
of potential false matches per corner, is reduced at the coarser levels. Furthermore, 
at the coarser level, it is less computationally intensive to estimate the global image 
deformation (e.g. cyclorotation), by testing different hypotheses for the deformation of 
the cross correlation between image patches. 

Two problems arise with this. First, the parametric form of the distribution is not 
known. Second, the normalizing factor of the distribution is not known. The first problem 
is overcome by representing the distribution by a set of particles {9 i ... 0 m} with 
weights {ici . . .Wm}- This sort of representation has been used with a good deal of 
success in the tracking literature m. Ideally the set of particles would be drawn from 
the posterior distribution. One way to achieve this is via importance sampling, which is 
defined next. 



4.1 Importance Sampling 

Importance sampling O is a key step in drawing approximate samples from complicated 
high dimensional posterior distributions for which the normalization factor is unknown. 
Suppose it is of interest to draw samples from such a distribution q{9), and there exists 
a normalized positive density function (the importance sampling function) g{9) from 
which it is possible to draw samples. The algorithm proceeds as follows: 



1 . 

2 . 

3. 

4. 



Generate a set of M draws = {9 \ , .9 m} from g{0). 

Evaluate q{9) for each element of S'*. 

(0 \ 

Calculate importance weights Wi = ’j for each element of S* . 

Sample a new draw from S*"*"^ from S* where the probability of taking a new 9 i is 
proportional to its weight Wi . 



Iterating this procedure from step 2 is called sampling importance resampling (SIR). 
This process, in the limit, produces a fair sample from the distribution q{9) [9j. The rate 
of convergence is determined by the suitability of the importance function g{9). The 
worst possible scenario occurs when the importance ratios are small with high probability 
and large with low probability. There is no general purpose method for choosing a good 
importance sampling function, but in the next section it will be explained how RANSAC 
can be used to construct one. 
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4.2 Using RANSAC to Generate the Importance Sampling Function: IMPSAC 

The success of RANSAC-style methods proves that at least some of the generated sam- 
ples lie in areas of high posterior probability. It would be nice to be able to harness the 
RANSAC mechanism in order to generate a good importance sampling function with 
which to propagate information from coarse to fine levels. There are several ways in 
which this can be done. The method we favour is to model the importance function 
g{6) as a mixture of Gaussians, each centred at a RANSAC sample, with the mixing 
parameters being in proportion to the posterior likelihood of each sample: p{0 |D). This 
presents a new method for propagating probabilities: generate a density function g{0) 
via RANSAC and use this as an importance sampling function to draw samples from 
the posterior. This method is dubbed “IMPSAC”. 

Speed Up 1. Using all the particles to generate the mixture of Gaussians can be slow. 
Generally if the distribution is to be represented by L particles then a particle can be 
excluded from the computation if it contains less than 1/L of the mass of the density 
function. 

Speed Up 2. Often the artifice of constructing the mixture of Gaussians can be 
computationally onerous. A simpler device can be obtained under the assumption that 
the initial set of particles generated by the random sampling of minimal match sets is 
uniform. Although this assumption is not realistic in theory, unless we are interested 
in calculating integrals or exact expectations under the distribution, it is safe to make 
in practise (when all we are interested in is finding the mode of the distribution). One 
case when the exact posterior would be of interest would be if one was evaluating the 
evidence to effect model selection (e.g. choosing whether F or H best modelled the 
data. This is the subject of a forthcoming paper). 



5 The Augmented Fundamental Matrix 

In I27ll it was shown that using H to guide matches throughout the sequence leads to 
fewer matches being extracted in the part of the sequence undergoing a general motion, 
as might be expected since the model underfits this part. However, when a loose threshold 
of 3 pixels was used (as opposed to a threshold of 1.25 pixels which is the two sigma 
window arising from interest point measurement noise) the homography is able to carry 
correct matches even when the planar assumption is broken. The explanation lies in 
the “plane plus parallax” model of image motion O: the estimated homography often 
behaves as if induced by a ‘scene average’ plane, or indeed is induced by a dominant 
scene plane; the homography map removes the effects of camera rotation and change 
in internal parameters, and is an exact map for points on the plane. The only residual 
image motion (which is parallax relative to this homography) arises from the scene relief 
relative to the plane. Often this parallax is less than the loose displacement threshold, 
so that all correspondences may still be obtained. Thus the homography provides strong 
disambiguation for matching and the parallax effects do not exceed the loose threshold. 

This suggests a new method for matching, in which one (or more) homographies 
and a fundamental matrix are estimated for the data. The homographies estimated at the 
coarser level are used to guide the search windows in order to detect matches for the 
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features at the finer level. They can also be used to guide the cross correlation matching 
at the finer level in that the patches that are correlated can be corrected by transformation 
under the homography. This representation is referred to as the augmented fundamental 
or affine fundamental matrix FJ^. For the examples presented in this paper, one 
homography is sufficient to guide matching. This leads to a 10 parameter estimation 
problem for F’*' (8 for the homography and 2 for the epipole, alternatively: 7 for the 
fundamental matrix and 3 for the plane of the homography), and 7 for F^ (6 for the 
affinity and 1 for the epipole, alternatively 4 for the affine fundamental matrix and 
3 for the plane). Future work will consider the use of several planes to augment the 
fundamental matrix, but for many image sequences one seems to be sufficient to get 
good matches. 

In order to estimate the augmented relation, the likelihood for a match given this 
relation ISection lTTb is decomposed into two parts: the first is the usual likelihood of the 
fundamental matrix 0, the second is the likelihood of the parallax in the image given 
the homography. This is assumed to be Gaussian with large variance. This has the effect 
in general that if two equally good matches happen to lie along an epipolar line the one 
closer to the base plane represented by the homography is favoured. 

5.1 Augmented Likelihood Formulation 

Previously the optimisation was done on only the “best” set of matches found under cross 
correlation. If the image deformation is unknown, this is no longer acceptable and the 
likelihoods must be extended to incorporate a term for the probability of the correlation 
conditioned on a given match and a given homography. Given the set of images (the 
data) D; at level I of the image pyramid, both the parameters of the relation 6 and the 
set of matches Si,i = 1 . . .n need to be estimated. Here the ith match is encoded by 6i, 
which is the disparity of the ith feature of the first image. The set of disparities of all the 
features is A. The laws of probability give: 

p{6 , Z\|Dz) cx p{Th\e , A)p{6 , A) = p{T>i\6 , A)p{A\6 )p{9 ) . (6) 

Under the assumption that the errors in each match are independent, and that the the 
distribution of matches are independent: 

p{6 ,A\m) =l[p{e ^Y[p{Th\e ,SM5^\0)p{d) ■ (7) 

i i 

This is the criterion to be optimised. However, only the augmented relation 6 is propa- 
gated from the coarser level, and the matches are encoded by the homography part of 9 
and the disparity assigned to the parallax. 

The probability of 9 can be calculated by integrating out the disparity parameters. 
Note the following identity: p(X, Y|I)c?Y = p(X|I). Then 

p(0|Di)oc J p(Di\9 ,Si)p{Si\0)p{9)d5i X . . . X J p(Di\6 ,Sn)p{5n\9)p{0)dS„. (8) 

Since 6i may take only a finite number of values, corresponding to the features j = 
1 ... m of the second image (see below for the case of occlusion), 

p(0|D,)cx i\9 ,Si = j)p{Si =j\9)p{9) 

i j 



(9) 
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Each term in this expression is the product of three elements. First, p(D;|0 , Si = j) 
is the likelihood of the image (patches) given the augmented fundamental matrix 
and the match Si = j. This is evaluated from the cross correlation score after warp 
under the homography part of under the assumption that the image intensities have 
Gaussian error mean zero and standard deviation (T£). The second termp((5i = j\9 ) is the 
likelihood of the match given the relation, given by equation 0 (account for occlusion 
is made below). The third term p{0) is the prior on the relation, assumed uniform here, 
but this can be altered to include any appropriate prior knowledge. 

Thus the decomposition above is useful in two ways: (1) it yields p{9 |D;) without 
having to commit to a set of matches and (2) the likelihood p(Di\9 , Si) takes account 
of the different hypothesised image deformations. 

Occlusion To take account of occlusion, the disparity Si for a given match can take 
a null value, representing the fact no match can be found with a hnite probability, that 
is p{Si = 0) = pi- For this value of Si, the conditional probability of the image patch 
correlation p{T)i\9 , Si) is also set to a constant value p 2 - The resulting estimate of 9 
remains constant over a large range of pi 2 - Smaller values of these constants tend to 
peak the distribution, while larger values flatten it. 



6 Feature Matching Algorithm Using IMPSAC 



The algorithm is summarized in Fig. El The hrst stage is to generate the features at all 
levels. Then, at the coarsest scale, cross correlation scores are generated between all 
features, with each patch undergoing 16 evenly space rotations (this is only necessary 
if image deformation is expected). Random sampling of minimal match sets is used to 
generate an initial set of putative solutions, each match being picked in proportion to its 
correlation likelihood. 

After the coarsest level I = 0, two options are considered for generation of the subse- 
quent importance sampling functions, both valid. The first method (importance sampling) 
is to use the mixture of Gaussian methods described above. This has the advantage that 
new particles are generated across the whole parameter space, the disadvantage that it is 
slow to compute. The second method (importance resampling) represents gi{9), I > 0 



using the set of particles S' each assigned probability p(0 i) = where tt^ = 



w{9 i 






and w{9 i) = ^ problem with the resampling approach is that one particle 

9 rnax may come to represent all the probability mass at a given level and hence all 
the particles at the hner level will be replicas of it. One solution to this problem in a 
different setting is justihed by Sullivan and Blake 11211 in which a small amount of noise 
(compensated for by subtracting it from the prior p{9)) is added to each particle as it 
is transmitted to the next level. This can be intuitively explained in this case by the fact 
that the resolution of the match-coordinates changes as the image is subsampled (here 
by a factor of 2). For instance, if the features are not represented to sub-pixel accuracy, 
then change of scale introduces some uncertainty into where the features should lie at 
the next scale of the order 0-1 pixel. Each particle was estimated from a minimal set 
of feature matches. Thus, to add uncertainty to 9 , noise from 0-1 pixel is added to the 
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minimal set used to estimate it. In this case, each particle represents a distribution over 
9 -space, determined by the level of uncertainty in the coordinates. 

Table 2. Feature Matching Algorithm using IMPSAC. 



1. At each scale: Detect features. 

2. Putative matching of corners over the coarsest two images using proximity and cross 
correlation under a variety of rotations. 

3. At the coarsest level. Generate a set of particles S° = ™d weights 

m — 1 ... M as follows: 

a) Select a random sample without replacement of the minimum number of correspon- 
dences required to estimate the relation TZ 
h) Calculate 0 ° from this minimal set. 
c) Calculate = p{6 |Dq) for each sample. 

4. For I — Itol — finest level 

a) Generate an importance sampling function gi{6) from S^~^. 
h) Generate M draws from gi , to generate S'* . 
c) For each 0 *, calculate Wi = p{0 [\Y)i)/ gi{0 \). 

5. The particle with the maximum posterior probability is taken as the MAP estimate. This 
can then be used as a starting point for a gradient descent algorithm. 



7 Results 

The final stage of the algorithm in Table |2| is to select the most likely particle at the 
finest level as the most likely hypothesis. This is the particle 9 imax which maximises 
p{9 i\D). The feature in the first image is matched to the feature j in the second 
image which maximises p{5i = j\9 imax) - Figured shows the successful matching of 
two images with up to 160 pixels disparity, demonstrating the capacity of IMPSAC for 
wide baseline matching. Figure 0 shows how IMPSAC is robust to large rotations of 
the image. In figure 0 mismatches of MLESAC are corrected by rematching with the 
augmented likelihood, doubling the number of matched features. 



8 Future Work 



Due to space constraints, model selection is not the topic of this paper. However it will 
be briefly illustrated how importance sampling can be used to evaluate the marginal 
likelihoods required for model comparison. Given a set of k models Mi . . . that can 
explain the data D (here the models are fundamental matrix, homography, augmented 
fundamental matrix etc.) then Bayes rule leads us to 



p(M,|DI) 



p(D|M,IMM,|I) 



( 10 ) 
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Fig. 3. Wide Baseline Success of IMPSAC: the first and last images from the Samsung sequence, 
captured at the same time but from different positions. The disparity between the images is up to 
160 pixels, yet only 3 or 4 of the 50 example matches shown are mismatched. 




Fig. 4. Rotation Success of IMPSAC: Despite the combination of a rotation of 90 degrees and 
the change in pose of the face, the features are correctly matched. Although just 40 features are 
shown for clarity, over 1000 were matched. 
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Fig. 5. MLES AC Mismatches Corrected by Augmented Likelihood: (Above ) MLES AC matches 
with affine fundamental matrix include numerous mismatches. (Below) From the same MLES AC 
hypothesis, rematching with augmented likelihood increases number of matches from 509 to 1274, 
also reducing mismatches. 



where I is the prior information assumed about the world. Note p(D|I) is the same for 
all models. Assuming that all the models are equally likely a priori i.e. the 

key posterior likelihood of each model is the evaluation of p(D|MyI), which is called 
the evidence. This is the integral of the likelihood over all possible values of the model’s 
parameters: 

p(D|MjI) = j p{B\Mj91)p{9\M^l)de (11) 

where 9 are the jth model’s parameters, and p{9 |MyI) is the prior distribution of 
parameters of the model. One method for numerically evaluating this integral would 
be to uniformly sample the parameter space and sum the posteriors of the samples. 
Unfortunately the high dimensionality of the parameter space precludes this. One could 
draw samples from the prior and sum the posterior of these samples, but typically the prior 
is too diffuse to yield samples around the peak of the distribution. Importance sampling 
furnishes a Monte Carlo method for performing this integration 0, the advantage of 
which is that samples can be taken more densely around the expected peak of the posterior 
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and less densely in areas of little interest. If the importance sampling function is g{9) 
ig{6) is a normalized density), then given a set of M particles drawn from g{9) 



p(D|M,I) ^ 






E 



p(D|Mj6>I)p(6>|MjI) 

W) 



as M — >■ oo 



( 12 ) 



Evaluation of this leads to the selection of an augmented fundamental matrix model for 
the Samsung sequence shown in Figure 0 a homography model for the Zhang sequence 
shown in Figure 0] and an augmented affine fundamental matrix for Figure 0 



9 Conclusion 

Within this paper coarse to fine estimation of structure and motion has been demonstra- 
ted. This has been achieved through the synthesis of powerful statistical techniques. The 
concept of using a random sampling estimator to generate the importance sampling fun- 
ction, IMPSAC, is a general mechanism that can be used in a wide variety of statistical 
problems beyond this. It provides a solution to the general problem of how to create im- 
portance sampling functions for outlier corrupted data. The coarse to fine strategy helps 
overcome the wide baseline problem, and this combined with the plane plus parallax 
representation (the augmented fundamental matrix) overcomes the image deformation 
problem. The resultant is a general purpose and powerful image matching algorithm that 
can be used for 3D reconstruction or compression. Finally how the importance sampling 
can also be used for automatic model selection is explained. 
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Abstract. This paper considers a fundamental problem in visual motion 
perception, namely the problem of egomotion estimation based on visual 
input. Many of the existing techniques for solving this problem rely on 
restrictive assumptions regarding the observer’s motion or even the scene 
structure. Moreover, they often resort to searching the high dimensional 
space of possible solutions, a strategy which might be inefficient in terms 
of computational complexity and exhibit convergence problems if the se- 
arch is initiated far away from the correct solution. In this work, a novel 
linear constraint that involves quantities that depend on the egomotion 
parameters is developed. The constraint is defined in terms of the optical 
flow vectors pertaining to four collinear image points and is applicable re- 
gardless of the egomotion or the scene structure. In addition, it is exact in 
the sense that no approximations are made for deriving it. Combined with 
robust linear regression techniques, the constraint enables the recovery of 
the FOE, thereby decoupling the 3D motion parameters. Extensive simu- 
lations as well as experiments with real optical flow fields provide evidence 
regarding the performance of the proposed method under varying noise 
levels and camera motions. 



1 Introduction 

Knowledge of the velocity of a mobile system with respect to its environment is 
essential for various servoing tasks that are based on visual feedback, e.g. collision 
avoidance, docking, image stabilization, etc. Given a sequence of images acqui- 
red by a monocular observer pursuing unrestricted rigid motion, the problem of 
egomotion estimation can be defined as the problem of recovering the linear and 
angular velocities comprising the motion of the observer. Although simply stated, 
the problem of estimating egomotion using visual input is particularly difficult. 
This difficulty primarily stems from the fact that the only information available 
from images is related to the observed 2D motion of image points, which depends 
both on the sought egomotion and the unknown 3D structure of the viewed scene. 

* This work has been carried out while the author was with the Computer Science 
Dept, Univ. of Crete and the Inst, of Computer Science, FORTH, Heraklion, Crete, 
Greece. Funding was partially supplied by the VIRGO research network of the TMR 
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Since the dependence of the 2D image motion on the scene structure is nonlinear, 
small errors in the estimates of 2D motion can have a significant impact on the 
accuracy of the recovered 3D motion ^ . In addition, the confounding of transla- 
tion and rotation makes the problem of estimating unrestricted egomotion much 
harder compared to the problem of estimating pure translation or rotation 
Due to its importance, many algorithms dealing with the problem of estimating 
egomotion have appeared in the literature. The following paragraphs provide a 
short review of a few representative methods; more detailed discussions can be 
found in mm- Most of the methods reviewed here rely on the availability of a 
dense optical flow field to describe 2D motion. Prazdny m, for example, assu- 
mes that surfaces in the viewed scene are smooth and recovers rotation through 
numerical optimization techniques using a set of nonlinear equations that are 
independent of translation. Prazdny m and later Burger and Bhanu 0 also 
suggested solving for rotation first and employed a search in the space of rota- 
tional parameters. For each hypothesized rotation, the corresponding rotational 
held was subtracted from the optical flow and the remaining held was tested 
for conformance to a purely translational flow field. Bruss and Horn [Q combine 
information from the whole visual held to determine the 3D motion that is the 
best least squares fit to the observed velocity held. They developed three different 
algorithms, the first two of which give closed form solutions for translation and 
rotation when the motion is purely translational or rotational respectively. The 
third algorithm applies to the case of general motion and estimates translation 
by minimizing an appropriate residual function using iterative numerical proce- 
dures. Reiger and Lawton m solve for translation by exploiting the phenomenon 
of motion parallax. By subtracting the optical flow vectors at two image locations 
whose corresponding 3D points have sufficiently different depths, a flow vector 
that is approximately pointing towards the FOeQ is obtained. The main drawb- 
ack of this approach stems from the fact that most optical flow algorithms cannot 
give accurate estimates of optical flow in areas with large depth variations. Re- 
cently, Irani et al j0| alleviated some of the difficulties related to the estimation 
of motion parallax by decomposing image motion into the sum of the motion of 
a planar surface and a residual planar parallax field that is purely translational. 

Heeger and Jepson 0 also make use of the residual function introduced in P 
and propose an efficient search technique for locating its minimum. Hummel and 
Sundareswaran P present an algorithm for finding the rotational motion and one 
for locating the FOE. The first algorithm is based on the observation that the 
curl of the optical flow field is approximately a linear function whose coefficients 
are proportional to the desired rotational parameters of motion. The algorithm 
for locating the FOE extends the work of Heeger and Jepson [ 7 | by considering for 
each candidate FOE the projection of the optical flow along vectors emanating 
from the former. Da Vitoria Lobo and Tsotsos m develop a constraint (the 
Collinear Point Constraint - CPC) involving flow projections at three collinear 
image points, which provides a means for canceling rotation and at the same time 



^ The FOE gives the direction of translation and is defined more rigorously in the 
following. 
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constraining the FOE to lie on the line defined by the collinear points. The CPC 
is discussed in more detail in Section 0 Optical fiow projections are also used 
in ^21 and the FOE is recovered through their pairwise differences. Daniilidis 0 
employs fixation on a scene point to reduce the number of motion parameters to 
be estimated from five to four. The associated spherical motion field is projected 
on two latitudinal directions and the motion parameters are then found by two 
one-dimensional searches along meridians of the image sphere. 

In this paper, it is assumed that either the viewed scene is static or the 
independently moving objects have been identified and masked out |14| . The 
motivation behind our egomotion estimation method is twofold. First, we are 
interested in estimating egomotion by means of linear constraints. Second, we 
want to avoid making any restrictive assumptions regarding the egomotion or the 
scene structure. Hence, we have developed a novel linear constraint regarding the 
motion parameters, defined in terms of four collinear image points. The constraint 
is applicable regardless of the egomotion or the scene structure and combined 
with robust linear regression techniques, permits the recovery of the direction of 
translation, thereby decoupling the 3D motion parameters. The rest of this paper 
is organized as follows. Section Elpresents an overview of some preliminary results 
that are essential for the development of the proposed method. Section 0 develops 
the proposed constraint and shows how it can be employed to recover egomotion. 
Experimental results from an implementation of the method are presented in 
Section 0 The paper is concluded with a brief discussion in Section 0 A more 
detailed version can be found in ca- 

2 Visual Motion Representation 

Before proceeding with the description of the proposed method, issues related 
to motion representation are discussed. Consider a coordinate system OXY Z 
positioned at the optical center (nodal point) of a pinhole camera, such that 
the OZ axis coincides with the optical axis. Suppose that the camera is mo- 
ving rigidly with respect to its 3D static environment with translational motion 
{U,V,W) and rotational motion (a,/?, 7 ). Under perspective projection, the 3D 
point P{X, Y, Z) projects to image point p{x, y) which moves on the image plane 
with velocity (u,v), given by mi: 



Equations m describe the optical fiow field, which relates the 3D motion of points 
to their projected 2D motion on the image plane. The problem of estimating the 
optical fiow from an image sequence is fundamental to motion analysis. However, 
due to space limitations, it will not be discussed further here. An excellent intro- 
duction to the problem as well as a review of the state of the art can be found 




( 1 ) 
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in CHI Several observations regarding Eqs. (P) can be made. First, the effect 
of translation on the observed 2D motion is independent from that of rotation, 
i.e. the translational and rotational components of motion are separable. Second, 
the rotational component of motion is independent of scene structure, since the 
depth Z influences the translational component only. Third, the vectors defined 
by the translational components of the motion field, lie on lines going through 
the point (xo,j/o) = {U f IW,V f jW), which is known as the Focus Of Expansion 
(FOE). The FOE defines the direction of the translational motion, and is of cen- 
tral importance for several motion analysis problems. Finally, if the quantities 
W and Z are multiplied by the same scale factor, the flow defined by Eqs. m 
remains the same. In other words, there exists a scale ambiguity that prevents us 
from differentiating between a close object moving slowly and a distant one that 
is moving fast. Thus, the information related to the translational component of 
egomotion that can be recovered from Eqs. |H) is at most its direction, i.e. the 
FOE. The ratio is often referred to as the time-to- contact m 

3 Using Quadruples of Collinear Points to Constrain the 
FOE 

In the following, it is assumed that the camera has been intrinsically calibrated, 
so that the retinal transformations among pixel and image coordinate systems 
are known H51. Before proceeding to the description of the proposed method, we 
state two theorems which are essential for its derivation. The proofs, which are 
omitted due to space limitations, can be found in H2|. 



3.1 Two Precursory Theorems 

Theorem 1 Suppose that two image points pi = (xi,yi) and P2 = (x2,y2) He 
on a line that goes through the origin of the image eoordinate system (i.e. the 
principal point). The difference of the projections of their eorresponding optieal 
flow vectors along the direction n = (nx,ny) that is normal to the line is equal to 

1 1 0 ^ 

uni-un2 = DW{— —) + —{x2-xi), ( 2 ) 

Z\ Z2 Uy 

where uni = UiUx + ViUy, i = 1 , 2 and D = {x\ — xo)nx + {yi — yo)ny. 



Theorem 2 Tef pi = {x\,yi), P2 = (x2,y2) and ps = {x^,yfl) be three eollinear 
image points lying on a line whose equation is y = kx + v. Let also (xo,yo) be 
the FOE and assume that p2 divides the line segment pi pa in ratio X. For the 
projections uui,i = 1...3 of the optieal flow vectors at points Pi,P2 and pa 
along an arbitrary direction (rix,ny), the following holds 



UUr, — 



1 + A 



rin, — 



1 + A 
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Z2 \ X Z\ 1 + A Z^ 
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In the above equation, D2 = {x2 — xq)ux + (j/2 — yo)ny and d2i = {x2 — Xi)ux + 
(2/2 - Vihy. 

By inspecting Eq. m, it can easily be seen that in the case that the direction 
of projection {nx,ny) is perpendicular to the line defined by the points pi, the 
term ^21 is zero, thus the sum of the rotational components vanishes. The re- 
maining terms are identical to the expression for the Collinear Point Constraint 
(CPC) that was derived by Da Vitoria Lobo and Tsotsos in ^01 ■ The CPC states 
that when an appropriate linear combination of the projections of optical flow 
vectors in the direction perpendicular to the line joining them is zero, there exist 
two possible situations. Either the three 3D points whose projections form the 
collinear triplet are also collinear in the scene (i.e. — 0)> 

or the line defined by the collinear triplet passes through the FOE (i.e. D 2 = 0)- 
By employing a voting scheme to differentiate between these two cases, the CPC 
has been combined in HOI with exhaustive image based search for locating the 
FOE. 



3.2 The Proposed Constraint on Egomotion 

Assume now a mobile observer undergoing rigid motion in a static environment. 
Let Pi = (a;i, yi), p 2 = (x 2 , 2/2) and pa = (0:3, j/3) be three collinear image points 
lying on a line £ through the image principal point. Let also {nx,Uy) be the 
direction normal to C and (nx,ny) and {UxjUy) two other directions that are not 
perpendicular to C. According to Theorem 0 for the projections of the optical 
flow vectors along the direction {n„.,ny) the following holds 

' 1 ^ A , 1 11 A 1 , 

un^ un, un^ = Ur,W{ ) -I- (4 

^ 1-kA ^ 1-kA 3 2 ^^2 1 + AZi l + AZa^ ^ ^ 



where the primed terms are defined analogously to the unprimed ones in Eq. ( 0 . 
Similarly, for the projections along the normal direction {nx,Uy), Eq. gives 
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^ 1-kA " 1-kA 
Dividing Eq. (0) with Eq. (jSj) yields 
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Applying Eq. (0 for points pi and pa results in uni — un^ = D 2 W{-^ — -^) + 
^(x 3 — xi). Solving this equation for dividing in terms by Eq. 0 and 
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substituting the result into Eq. ® yields 
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Let now p4 = (0:4, 1/4) be a fourth point collinear with the triplet Pi,P2 and pa 
and such that point p2 divides the segment pi p4 in ratio /r. Eq. (7) gives for 
the projections along the direction (n^,ny) 



un'2 - i^un'i - Y^un'i 
“”2 - d'21 



fi . n 
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Subtracting Eq. (8) from Eq. (7) and noting that = X2 — X3 and 

X2 — Xi, results in 
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The term /'^2i (gj jg independent of the FOE and can be com- 

puted using the point retinal coordinates only. Indeed, it can be shown that 

- -P2/4 'i ^ (n"n^' - n^n'y)ny 

D 2 {n^^n'y - n^ny){n,^ny - n'^ny){x2 - xi) 

Equation is independent of the scene depths and linear in the two unknowns 
and -jy ^ — h na — P, therefore forms the basis for the development of the pro- 
posed egomotion estimation method: Given a line C through the image principal 
point, Eq. @ is employed for estimating the term corresponding to C. In 
theory, two quadruples of image points lying on C suffice to provide estimates 
of the unknown parameters and -jy ^ — h na — p. However, to enhance noise 
immunity, multiple quadruples of points on C are selected at random and robust 
estimates of the two unknowns are computed using the LMedS robust estima- 
tor m- Knowledge of the term Df foi ^ ^ provides one constraint on the 

location of the FOE, namely 



jC \ -C £i £i \ C. C. T~\C. 

xon,^ + yony = X + y Uy - , 
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where (xo,yo) is the sought FOE, {n^,riy) is the unit normal for line C and 
is a point on £. Noting that each line C through the image principal 
point supplies one constraint of the form of Eq. ill ill regarding the FOE, the 
constraints arising from multiple such lines can be combined to yield the FOE. 
More specifically, using many lines through the image principal point, robust 

estimates of the corresponding distances are obtained as previously outlined. 

^2 __ 

For each of the obtained distance estimates, Eq. (HU gives rise to a linear con- 
straint regarding the FOE. The LMedS estimator is then applied once again on 
these constraints to give a robust estimate of the FOE. If required, estimates of 
the rotational velocity can be obtained in a similar manner by employing robust 
regression for (a, /3, 7 ) on the constraints derived from the terms — h Ka — /3 

computed for each line through the image principal point. Alternatively, rotation 
can be estimated using optical flow projections along directions that are normal 
to lines through the estimated FOE and therefore are independent of translation. 

4 Experimental Results 

The proposed method has been extensively tested with the aid of simulated and 
real flow fields. Representative results from these experiments are given in this 
section. In all the experiments reported here, at most 180 lines through the image 
principal point and 400 quadruples of points along each line have been employed. 



4.1 Synthetic Flow Fields 

The use of simulated data is justified by the fact that knowledge of the ground 
truth facilitates a quantitative assessment of the accuracy of the results. Besides, 
simulation enables us to vary in a controlled manner subsets of the parameters 
involved in the problem of egomotion estimation and then study their effect on 
the recovered motion. Therefore, a simulator has been constructed, which given 
appropriate values for the intrinsic parameters of the simulated camera (focal 
length and principal point), the translational and rotational motion parameters, 
the dimensions of the retina and the depth corresponding to each image point, 
employs Eqs. (U to synthesize an optical flow field. The depths of image points 
are generated by random variables following various distributions. For the expe- 
riments reported here, a uniform distribution in the range [Zmini Zmax\ and a 
Gaussian distribution with nonzero mean have been employed. All distances and 
sizes used by the simulator are specified in units of pixels. To account for the fact 
that optical flow fields might be sparse, their density, i.e. a percentage specifying 
the fraction of image points for which optical flow vectors have been computed, 
can be supplied. To make the simulated optical flow fields more realistic, noise is 
added to the synthetic optical flows. The noise we employ is generated according 
to the model suggested in m-- 

Unoisy = u + signi * N{a,b) * u , Vnoisy = v + sign2 * N{a,b) * v 
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where signi and sigu 2 are binary values (i.e. 1 or -1) that are randomly chosen 
with equal probability and N(a, b) is a Gaussian random variable with mean a 
and standard deviation b. This noise model is referred to as “Gaussian noise with 
mean a% and a = &%” . As noted in CO], 8% and 2% are realistic values for the 
noise mean and the standard deviation respectively, accounting for most of the 
errors observed in actual flow fields. 

Throughout all experiments, image size was 512 x 512 pixels and the principal 
point was assumed to be in the center of the image. Also, in all but the third set 
of experiments, the focal length was 256 pixels, amounting to a held of view of 90 
degrees. The density of the optical flow fields was 70%. Two different scenarios 
for the scene depth were simulated. The first uses a random variable that is 
uniformly distributed in the range [10000, 50000] pixels to model the depth of 
a scene with large depth variations. The second scenario employs a Gaussian 
distribution with mean 15000 pixels and standard deviation 3000, to emulate 
a scene with less depth variation, in which the majority of the points lie at 
a dominant depth rather close to the camera. To ensure that the results are 
independent of the exact depth values used to synthesize the optical flow field, 
each experiment was run 100 times, each time using a different depth population 
drawn from the distributions described above. 

In the first set of experiments, the effect of noise on the accuracy of the 
estimated FOE is examined. Employing increasing noise levels. Figures E (a) 
and (b) illustrate the mean and the standard deviation respectively of the FOE 
error for both depth distributions. Each point in the plots summarizes error 
statistics computed from 100 runs. If / is the focal length and the true FOE 
is at (xo^uo) while the estimated is at (xo,yo), the error in the FOE estimate 
is defined as the angle between the vectors {xo,yo,f) and (afo,yo,/), given by 
cos~^( The 3D motion parameters used to synthesize flow 

were {U, V, W) = (—120, 100, 150) (measured in pixels per frame) and (a, /3, 7) = 
(0.005,0.004,0.002) (measured in radians per frame). The egomotion parameters 
and the depth values are such that the magnitude of the average translational 
component of the flow fields is comparable to that of the average rotational 
component. The angle between the direction of translation and the optical axis 
is about 46 degrees. The noise mean was increased to 12% in steps of 1% and 
the standard deviation was kept equal to 2%. As expected, the error increases 
with noise but remains acceptable even with very large amounts of noise. The 
error in the case of Gaussian depths is smaller since in this case the translational 
component of motion is larger than that in the case of uniformly distributed 
depths; this is further explained in the discussion of the experiments related to 
the magnitude of translation below. 

It has been observed in previous work on egomotion estimation that the error 
of the estimated FOE increases with the angle between the direction of translation 
and the direction of gaze (i.e. the direction defined by the optical axis) 0. The 
second set of experiments studies the dependence of the FOE error on this angle 
for the proposed method. Figures El (a) and (b) show the mean and standard 
deviation of the FOE error with respect to the angle between the direction of 
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(a) (b) 

Fig. 1. (a) Mean FOE error versus noise and (b) Standard deviation of FOE error 
versus noise. 



translation and the direction of gaze. The direction of translation was varied from 
(0,0,/) to (/, 0, /), where / is the focal length. In other words, the translations 
considered range from a straight ahead motion to a sideways motion forming 
an angle of 45 degrees with the direction of gaze. The rotation parameters were 
again equal to (a,/3, 7) = (0.005,0.004,0.002) and the magnitude of translation 
has been kept constant, equal to 216.565 pixels per frame, which is the magnitude 
of translation used in the first set of experiments. Each point in the graphs has 
been computed from 100 trials, performed with Gaussian noise of mean 8% and 
standard deviation of 2%. As can be seen from Fig.|2|(a), the FOE error does not 
vary considerably when the angle between the direction of translation and the 
direction of gaze is increased. This is a desirable characteristic of the proposed 
method, since it implies that the observer does not need to fixate on the estimated 
FOE to ensure small errors in the FOE estimates. 





(a) (b) 



Fig. 2. (a) Mean FOE error versus the angle between the direction of translation and 
the direction of gaze and (b) Standard deviation of FOE error versus the angle between 
the direction of translation and the direction of gaze. 
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The third set of experiments investigates the dependence of the FOE error on 
the field of view size. Figures|3 (a) and (b) show the mean and standard deviation 
of the FOE error with respect to the size of the field of view. The field of view size 
was varied by adjusting the focal length while keeping the image size constant. 
More specifically, the former was decreased by a multiplicative factor of 0.5 from 
2048 to 64 pixels while the image size remained equal to 512 x 512 pixels. This 
change of the focal length amounts to the field of view being increased from 14.250 
to 151.927 degrees. Recall that a focal length of 256 pixels used in the previous 
experiments corresponds to a field of view equal to 90 degrees. The simulated 3D 
velocity was identical to that of the first set of experiments, i.e. translation was 
equal to (—120, 100, 150) and rotation to (0.005,0.004,0.002). Gaussian noise of 
mean 8% and standard deviation of 2% was added to the simulated flows and 
each point in the graphs was again computed from 100 trials. As can be seen 
from Figs. 0 the error in the recovered FOE is almost identical for both depth 
distributions. More specifically, the FOE error is very large for small fields of 
view but becomes acceptable when the latter are larger than 25 degrees. This 
observation agrees with the theoretical findings of [tlhj . which conclude that the 
inhomogeneous flow characteristics of a large field of view make it more helpful for 
determining the singularities of the flow field (i.e. the FOE and axis of rotation) 
compared to a narrow field of view. This conclusion holds independently of the 
particular algorithm that is employed to recover 3D motion. 
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(a) 

Fig. 3. (a) Mean FOE error versus the size 
tion of FOE error versus the size of the fiel 
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of the field of view and (b) Standard devia- 
d of view. 



The last set of experiments evaluates the performance of the method when the 
ratio between the magnitude of translation and that of rotation is varied. More 
specifically, assuming that the rotation is constant. Figures 0(a) and (b) depict 
the effect of variable translation magnitude on the mean and the standard devia- 
tion of the FOE error. In this series of experiments, the direction of translation 
is identical to that defined by (U, V, W) = (—120, 100, 150), but its magnitude is 
increased by a multiplicative factor of 1.5 between successive experiments. The 
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rotation has been kept constant at (a,/3, 7) = (0.005,0.004,0.002) and 100 runs 
were made for each set of motion parameters. The noise was Gaussian with mean 
8% and standard deviation 2%. As can be clearly seen from the plots, the FOE 
error is significant when the translation magnitude is small (less than 130 pixels 
per frame in Fig.2|(a)). This is due to the fact that in this case, the translational 
components of the optical flow vectors are negligible compared to the rotational 
ones. Therefore, noise has a more pronounced effect on the translational compo- 
nents from which the FOE is recovered. However, as the magnitude of translation 
increases beyond 130 pixels per frame, the translational parts become comparable 
or even larger than the rotational ones. Thus, the translational parts are more 
immune to noise, giving rise to small FOE errors which are almost constant with 
respect to the magnitude of translation. Assuming constant translation, Figu- 





(a) (b) 



Fig. 4. (a) Mean FOE error versus magnitude of translation (b) Standard deviation of 
FOE error versus magnitude of translation. Note that the scale on the horizontal axes 
is logarithmic with base 1.5. 



res|^ (a) and (b) show the effects on the mean and the standard deviation of the 
FOE error induced by altering the rotation magnitude. Here, the behavior of the 
method is the converse of that observed in the case of constant rotation investi- 
gated in the previous paragraph. As can be seen from Fig.0 (a), the error in the 
FOE estimates is almost constant for realistic amounts of rotation (less than 0.5 
degrees per frame) . When the rotation increases too much, the flow field becomes 
mainly rotational, with the rotational components accounting for a large fraction 
of the full flow field. Thus, noise has an increased impact on the translational 
parts, resulting in large errors for the FOE estimates. During the experiments 
outlined in Fig.0 translation was kept fixed at (U, V, W) = (—120, 100, 150), the 
rotation magnitude was increased by a multiplicative factor of 2.0 between suc- 
cessive experiments and 100 runs were made for each experiment. As before, the 
noise was Gaussian with mean 8% and standard deviation 2%. Note that a rota- 
tion of (a,/3, 7) = (0.005,0.004,0.002) has a magnitude of 0.3845 degrees. When 
assuming continuous image motion (i.e. fine time sampling), rotations having 
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magnitudes larger than one degree per frame are very large and thus unrealistic. 




Fig. 5. (a) Mean FOE error versus magnitude of rotation and (b) Standard deviation 
of FOE error versus magnitude of rotation. Note that the scale on the horizontal axes 
is logarithmic with base 2.0. 



4.2 Real Image Sequences 

The method has also been tested using flow fields computed from real imagery for 
which the ground truth was known a priori. Throughout all experiments, optical 
flow was computed using an implementation of the Lucas & Kanade algorithm 
m- The first experiment employed the “yosemite” image sequence, one frame 
of which is shown in Fig. 0 (a). This sequence contains both translation and ro- 
tation and depicts a flight through Yosemite valley. Since the clouds are moving 
independently, only the optical flow vectors computed at the lower portion of 
the images have been employed. This portion of the original images corresponds 
to a held of view equal to 49.6 degrees horizontally and 29 degrees vertically. 
The true FOE is rather close to the center of the held of view, namely at (0, 
58)0 while the estimate computed by the proposed method was (-17.3, 72.3), a 
value that corresponds to an error of 22.4 pixels or 3.7 degrees. This amount of 
error compares favorably to errors in the “yosemite” FOE estimates appearing 
in the literature. More specifically, Heeger and Jepson Pj report an error of 3.5 
degrees for the “yosemite” sequence and Daniilidis reports an error of 4.0 
degrees. The rotation recovered by the proposed method using robust regression 
on projections of flow vectors that are perpendicular to lines through the reco- 
vered FOE, was equal to (0.000906,0.002116,0.000481) (in radians/frame). As 
mentioned in |C], the actual rotational velocity for the “yosemite” sequence is 
(0.00023,0.00162,0.00028). 

^ These are “calibrated” image coordinates, defined with respect to the image principal 
point. 
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Fig. 6. (a)-(d) the “yosemite” image sequence, (b)-(e) the “marbled block” image se- 
quence and (c)-(f) the “nasa” image sequence. One frame from each sequence is shown 
in the top row, while the optical flow fields used for egomotion estimation are shown in 
the bottom row. 



The second experiment refers to the “marbled block” sequence, one frame of 
which is shown in Fig.0 (b). The sequence was captured by a translating camera 
mounted on a robot arm that was moving above a textured floor in a right to 
left direction and contains many sharp discontinuities in depth and motion. The 
four dark blocks that lie on the floor are stationary, while the white block in the 
middle of the scene is moving independently with a right to left direction. The 
images of the “marbled block” sequence subtend 25.6 degrees of visual angle. 
The primary difficulty when estimating the egomotion for this sequence stems 
from the fact that the true FOE is outside the held of view, specifically at (777, 
95.6). Thus, the angle between the direction of translation and the optical axis is 
about 35 degrees. The proposed method estimated the FOE at (625.0, 111.4), in 
error by 152.7 pixels or 5.65 degrees. For comparison, the FOE estimate reported 
by Daniilidis in Pj amounts to an error of 7.17 degrees. The rotation estimated 
by the proposed method was equal to (—0.000748,0.000291,0.000031), close to 
being zero as expected. 

The last experiment is based on the “nasa” image sequence, shown in Fig. 
El (c). Since the camera undergoes a purely translational motion, a rotation 
of (a,/3, 7 ) = (-0.00025,-0.0018,0.00030) was added synthetically in order to 
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make the experiment more challenging. The ground truth for the FOE is (-5, 
-8) while the recovered FOE was (2.21, 49.29), in error by 57.74 pixels or 5.5 
degrees. For reference, the images of the “nasa” sequence subtend 24 degrees 
of visual angle. The rotation estimated by the proposed method was equal to 
(-0.000176,-0.001918,0.000138). The rather large error in the recovered FOE 
for the “nasa” sequence is due to the proximity between the true FOE and the 
image principal point. Therefore, in this case, the distance D 2 (see Eq. ®) of the 
FOE from every line through the principal point is very small and thus difficult 
to estimate accurately. 



5 Conclusions 

Accurate estimation of camera motion is important for many vision based tasks. 
In this paper, a novel constraint regarding the parameters of 3D motion has been 
presented. This constraint was used to develop a method for egomotion estimation 
that has several advantages. First, the method does not impose any constraints 
on the egomotion that can be recovered or on the structure of the viewed scene. 
Second, egomotion is computed through closed form solutions of linear equati- 
ons, avoiding searching the space of possible solutions. The use of such linear 
constraints permits the exploitation of overdetermined linear systems through 
the application of robust linear regression techniques. The egomotion estimate 
computed by the proposed method can either be used as is, or, optionally, for 
bootstrapping more elaborate, iterative nonlinear egomotion estimation methods 
for refining it. Third, instead of employing local information derived from small 
image regions, redundancy is exploited by combining information across the whole 
visual field. Fourth, the method does not assume the availability of a dense optical 
flow field. This is very important for practical applications, since image sequences 
often have uniform, textureless areas that give rise to sparse optical flow fields. 
Finally, the use of a robust estimator such as LMedS safeguards against errors in 
the input, which could otherwise have a significant effect on the accuracy of the 
computations. Experimental results collected from extensive simulations as well 
as real image sequences indicate the effectiveness and robustness of the proposed 
method. 
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Abstract. We present a method for fully automatic 3D reconstruction 
from a pair of uncalibrated images in order to deal with the modeling of 
complex rigid scenes. A 2D triangular mesh model of the scene is calcula- 
ted using a two-step algorithm mixing sparse matching and dense motion 
estimation approaches. The 2D mesh is iteratively refined to fit any ar- 
bitrary 3D surface. At convergence, each triangular patch corresponds 
to the projection of a 3D plane. The algorithm proposed here relies first 
on a dense disparity field. The dense field estimation modelized within a 
robust framework is constrained by the epipolar geometry. The resulting 
field is then segmented according to homographic models using iterative 
Delaunay triangulation. In association with a simplified self-calibration 
algorithm, this 2D planar model is used to obtain a VRML-compatible 
3D model of the scene. 



Many recent works attempt to deal with 3D reconstruction from set of images. 
Two different classes of approaches are generally proposed using different types 
of information: the first one includes model-based methods and the second one 
deals with model-free methods. 

In model-based approaches, the scene information is assumed to be composed 
of large polygonal objects described by a limited set of 3D points characterizing 
the vertices of each 3D plane. This model can be computed in the 2D space 
without 3D information. This can be done by extracting, matching and 3D re- 
constructing points of interest 0 or edges 0 . One of the main limitations of these 
methods is that effective planarity of generated facets is assumed but not always 
satisfied. To enforce a global planarity, a manual intervention is even usually 
necessary to indicate reliable coplanar points. Another way of estimating the 
model is to use disparity maps (or alternatively depth maps). In [HI, Koch et al. 
suggest computing differential properties from a dense disparity map. Images are 
then segmented according to similar surface orientation at each point of a region. 
The underlying strongly polyhedral assumption is indeed the major limitation 
of model-based techniques. 

To enlarge the variety of treated scenes, model-free representations (second 
class of approaches) have been proposed. Such methods generally rely on a dense 
disparity map. This map can be combined with weak or strong calibration infor- 
mation to provide a depth map that can be manipulated for view synthesis [tiliSj . 
The major limitation consists here in the estimation of reliable dense disparity 
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information allowing occlusion areas and spatial discontinuities to be coped with 
efficiently. 

The main objective of our study is to propose an entirely automatic approach 
for the reconstruction of not necessarily polyhedral textured scenes. In addition 
to this non-specialized goal, we impose to have the ability of an easy and real 
time visualization. This latter requirement dismisses practically the use of me- 
thods based entirely on a dense depth map. On the other hand, the removal of 
the polyhedral scenes assumption favors such approaches. Following these two 
remarks and in order to comply with the previously described goals, our aim is 
to suggest a compromise between model- free and model-based methods. We first 
propose to describe the 3D scene by a triangular mesh which can be displayed 
by most visualization dedicated systems. Our method therefore belongs to the 
first class (model-based approaches) but as this triangular mesh is automatically 
computed from a dense disparity field, it is also related to the second class. 

The key point of our method is to segment the images into regions which 
are actually planar in the 3D scene and to extract the planarity propriety from 
the image data (and not from a user intervention). This is indeed equivalent 
to realizing motion segmentation according to an homographic model. As the 
homographic model describing the set of admissible transformations of planar 
patches is non-linear, a direct region-based segmentation method is hardly fea- 
sible. We have therefore designed a two-step method. The first step provides a 
geometrically constrained dense depth map and an associated discontinuity map. 
This dense information is then used to initialize the second step: homographic 
model estimation and segmentation. 

The outline of the paper is the following. The first section briefly describes ge- 
ometric definitions associated with perspective projection of two images. In this 
context, epipolar geometry is presented. This important geometric constraint is 
used in all the following steps of our method and has to be previously estimated. 

In the second section, in order to facilitate a subsequent planar facet segmen- 
tation step, we present a geometrically constrained disparity field estimation. 
This technique is derived from a robust optical flow estimation approach. Unlike 
classical correlation methods, it provides a reliable piecewise smooth motion field 
EO . Moreover the disparity estimation is constrained by the associated epipo- 
lar geometry so that the estimated field is explicitly forced to be geometrically 
consistent with a perspective projection model and with the fixed scene assump- 
tion. This constraint also yields a substantial computational cost decrease (the 
2D disparity estimation problem is reduced to a ID problem). 

The third section presents the planar facet segmentation step of our method. 
To ensure the effective planarity of each reconstructed triangle, an adaptive 
iterative triangulation based on homographic models estimation is computed 
from the disparity field. 

By arbitrarily fixing intrinsic parameters, 3D rotation and translation para- 
meters can be extracted from the epipolar geometry. Using this 3D information, 
the resulting 2D model is then re-projected in the 3D space to be visualized as 
a VRML representation. 
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This method has been validated on synthetic and real world images. Com- 
parison with existing classical techniques are presented in the last section of the 
paper. 

Remark: in the following, vectors will be represented by bold letters. 

1 Epipolar Geometry 

1.1 Definition 

The characterization of the geometry associated with the two cameras is of key 
importance in order to build a 3D model of the scene. In our case, we deal 
with two uncalibrated cameras (or alternatively one moving camera shooting 
a rigid scene) assuming a pinhole camera model. This model characterizes the 
projection of a 3D point P(X, Y, Z) on a point p(cc, y) of the image plane. In the 
case of two images, the projection model is defined by a system of two equations 
linking a 3D point P(X, Y, Y) to its projections pi(cc,?/) in the first image and 
P 2 {x',y') in the second one. Without lost of generality, we assume that the 
world coordinate system coincides with the first camera coordinate system. The 
resulting system can be written using homogeneous coordinates as follows (where 
'denotes homogeneous coordinates): 

Pi = ^i[/ 0 ]P 
P2 = ^2[i?t]P 

R is the rotation matrix and t the translation vector between the first and 
the second camera location (extrinsic parameters). Matrix A contains internal 
camera parameters (intrinsic parameters). 

Eliminating P in equations dO leads to a relation linking the projections of 
a 3D point in both images: 

P2^2-^[t]xi?^l”^Pl =0, (2) 

where [t]x denotes the cross product matrix associated with the translation 
vector. 

This constraint called epipolar geometry has been first introduced by Longuet- 
Higgins HD. It is entirely defined by a 3 x 3 homogeneous matrix called the fun- 
damental matrix formulated as T 12 = A2~'^ [t\y_ RA\~^ . By construction, this 
matrix is of rank 2 and is defined up to a non-zero scalar factor. A fundamental 
matrix has therefore only seven degrees of freedom. 

The epipolar constraint can be used to determine the epipolar line I 2 in the 
second image associated to the point pi. It represents the line of pi potential 
correspondences in the second image. Line I 2 is given by: 

I2 = Fi 2P1 ( 3 ) 

where I 2 denotes homogeneous coordinates of I 2 , i.e. all points in I 2 satisfy 

I2P2 = 0 . 
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1.2 Case Specific Application 

The issue we are concerned is the recovery of 3D information from sets of 2D 
images. It consists in solving system (P) to obtain the 3D point P. To that 
end, corresponding points pi and P 2 and calibration parameters (extrinsic and 
intrinsic parameters) giving A\, A 2 , R, t have to be recovered. The first issue 
can be greatly simplified by constraining the matching process with the epipolar 
geometry while the second one can be achieved using a decomposition of the 
fundamental matrix (see section^. The epipolar geometry estimation is indeed 
a crucial key point of our method. The next paragraph will present the method 
we use to recover the fundamental matrix from two uncalibrated images. 

1.3 Fundamental Matrix Estimation 

We assumed here that corresponding points have been extracted and matched 
using an Harris and Stephens detector associated with a cross correlation process. 
This first step is equivalent to the one developed by Zhang EH 

To take into account the nullity of the fundamental matrix determinant, we 
followed a method proposed by Boufama et al. based on the virtual parallax Pj. 
This method may be briefly described as follow. The fundamental matrix is first 
estimated from 8 matches: three of them are selected to perform a projective 
change of basis to constraint the matrix to be of rank 2. A fourth arbitrary pair 
is also added to complete the projective change of basis 0. The four last pairs 
are then used to provide a unique fundamental matrix solution which respects 
the rank 2 constraints (determinant of null value) . 

In association with the determinant nullity constraint, the change of basis 
provides a normalization effect on points coordinates: the coordinates of the three 
points selected to characterize the new basis are assigned to values between 0 
and 1. This involves that coordinates of points belonging to the triangle defined 
by these points, also belong to the range of 0 to 1. In order to perform an optimal 
normalization, the pairs of points are chosen as near as possible to image corners. 

Besides, to cope with erroneous matches, a robust estimation based on least 
median squares estimation is incorporated m 

2 Dense Disparity Field Estimation 

2.1 Constrained Optical Flow Expression 

Let Ii{s) be the intensity in the ith image, where s{x,y) G S denotes the spatial 
position on grid S. Assuming a constant intensity along motion trajectories, the 
brightness constancy assumption is expressed as: 

DFD(s,d,) = /i(s)-/ 2 (s + d,) = 0 , (4) 

where DFD stands for the Displaced Frame Difference function and {d^ = 
{dx,dy), s S S'} for the image displacements from position 1 to position 2. In 
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the general case, this is a 2D problem: for each pixel, dx and dy have to be 
recovered. 

Using the epipolar constraint, it is possible to decompose the displacement 
vector dg into normal and tangential components with respect to the epipolar 
line (see Figure The brightness constancy assumption is therefore rewritten 
as: DFD{s, dg) = Ii(s) — / 2 (s + rig + AgVg) = 0. The normal component rig and 
the unit vector on the epipolar line Vg can be computed from the fundamental 
matrix for any position s (see eq. E) • The enforcement of the epipolar constraint 
at every point reduces the original 2D estimation problem to a ID problem: the 
estimation ofA = {Ag, s € S} along epipolar lines. 




Fig. 1. Displacement vector decomposition. 

The DFD expression is highly non linear with respect to the displacements. 
To avoid a tough non linear estimation, a Taylor expansion of this equation is 
considered around point s + rig . This linearization leads to a constrained optical 
flow equation: 



/i(s) - / 2 (s + rig) - AgVgV/ 2 (s + rig) = 0 
where V is the spatial gradient. 

This equation relies nevertheless on an inherent ambiguity. The fundamental 
matrix defining epipolar lines is well known to be far more reliably estimated 
for large displacement between two camera view points. This large displace- 
ment assumption somewhat contradicts the infinitesimal disparity hypothesis. 
To overcome this incompatibility, the estimation is embedded in a coarse-to-flne 
multiresolution scheme. 

2.2 Multiresolution Scheme 

At a given level k, the disparity Ag = {A^, s G S'} is decomposed into a pre- 
viously estimated disparity Ag“^ (coming from a coarser level A: — 1 or a previous 
iteration) and a refinement dXg = {dX^, s G S| to be estimated. Considering the 
brightness constancy assumption for the total displacement yields the following 
equation: 



-^1 (®) ~ ^2 + [^s ^ + dX'^]Vg) — 0, 



( 5 ) 
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to be solved with respect to dX^. 

This equation involves pyramids of images = {/f(s), s G S'}, A: = 
0, K, A = 1, 2, and pyramids of tangent and normal vectors = {rig , s G Sj, 
= jVg, s G Sj with k spanning from K (the coarsest resolution) to 0 (the 
finest resolution). The image pyramids are derived from the original images 
by successive Gaussian smoothing and regular subsampling by a factor of two 
in each direction. 

As for pyramid and v^, we consider fundamental matrices {f^}, k = 
0,...,K deduced for each level from the initial matrix F and a change of coor- 
dinates. More precisely, we have: 

Fk ^ 

where = diag(2^,2^, 1) is the matrix associated with the considered change 
of basis involved in the pyramidal representation. The matrix F^ allows to com- 
pute rig and Vg at resolution k for each position s. 

To insure that the previously estimated disparity at level k — 1 follows the 
current epipolar geometry at a given level fc, is deduced by projecting disparity 
d\^~^ (with a multiplying factor of 2) onto the epipolar lines at level k (see Fig. 

EJ. 




k. 

2.3 Global Estimation Method 

For sake of clarity, we will omit the resolution upper-script k in all expressions 
throughout the reminder of this paper. All the expressions will be meant to 
concern level k. Following the same principle as previously, the DFD expression 
is linearized around point s -|- ng -|- AsVg. This leads to a displaced version of 
the constrained optical flow equation: 

dAgVsV/ 2 (s) -|- / 2 (s) — /i(s) = 0. 

Assuming this equation is almost satisfied everywhere and that the disparity 
field is piecewise smooth, the disparity estimation problem may be addressed by 
the following minimization problem: 

dA = arg min F[{dX) = arg min {Hi{dX,Is) + a[F[ 2 {dX)]), (6) 

dAGRl = l dAGRl = l 
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where a is an arbitrary fixed constant. The first term Hi of the objective function 
H represents the data model: 



Hi{dX,Is) = 

ses 



dXsVsVhis) + l2(s) 




( 7 ) 



where I 2 = {his + rig + AgVg), s G S'} is the backward registered version of the 
second image. The second prior H 2 is the smoothness term which favors piece- 
wise smooth disparity solutions. This term is expressed over all pairs <s,r>G C 
of mutual neighbors (according to a 4-neighborhood system in our implementa- 
tion): 

H2(dX)= P(l|ds-d,||). 

<s,r>ec 

To cope with large deviations from the data model (resp. to allow disparity 
depth discontinuities), Hi (resp. H 2 ) includes a M-estimator, p. Under some 

simple conditions |3E]) (mainly the concavity of </>(u) = pi^/u)), any multidi- 
mensional minimization of the form “find argmin^,^ 'Yhi PiPii^))" equivalent 
to an optimization problem of the form “find argmin^, TZiPiix)'^ + ipizi)” 

involving auxiliary variables (or weights) Zi’s continuously lying in (0,1]. The 
function ?/> (which is never used in practice) is a decreasing function depending 
on p. In our case, the weights are of two natures: (a) the data outliers weights, 
5 = {i5s, s G S'} (provided by the semi-quadratic formulation of Hi, and (b) 
the discontinuity weights [3 = {Psr, < s,r >G C} related to the semi-quadratic 
formulation of H 2 and lying on the dual edge grid. The estimation problem is 
now expressed as a global minimization in (dA, (3, 5) oiT-L — Hi + aH 2 where: 

r7^i=^riAg[dAgVg.V/2(s) + /2(s) - 7i(s)]2 + V^(d,) 
j sGS 

\h2= ^ r2/3sr||dAsVs-7AgVg-7ng-d^||2-p^/,(/3g,,) 

L <s,r>ec 

where d,. = (A^ -I- dXr)'Vr + n^.. The scalar Ti a parameter depending on the 
M-estimator chosen. 

The energy contribution of a point s to "Hi is thus weighted by a factor 
Ss G (0,1]: the larger the contribution, the smaller the weight. Similarly, each 
pair of neighbors < s,r >G C contributes to H 2 with a weight (3gr S (0,1] 
depending on their displacement vector difference jjds — d^j]. The larger the 
difference, the smaller the weight. 

The resulting semi-quadratic minimization problem is conducted alternati- 
vely with respect to the different variables (here the scalar field dX and the two 
weight fields <5 and /3). The minimization with respect to weights are given in 
the following closed from I3I5I : 

argmin Zig^{x)'^ + tp{z^) = 

zi zgi(x) 



856 



L. Oisel, E. Memin, and L. Morin 



Now considering weights as being frozen, the minimization with respect to 
dXg is a classical weighted quadratic problem solved using an iterative method. 

(n) 

Using a Gauss-Seidel scheme, the local update dXs at iteration n of the iterative 
solver is given by: 



^X(n) ^ g(-As/3^ + - dsyrs.Vi2is).It{s) 

" ■ 'Ss{vs.Vl2{s)^ + aPs 

where is the weighted average of neighboring disparity vectors at iteration 

n — 1 and Ps is the sum the spatial discontinuity variables between s and its 
neighbors. 

Let us note that in case of long range disparity, an initial disparity field is 
necessary to avoid solutions corresponding to undesirable local minima. In our 
case, we consider an initialization derived from the interpolation of the initial 
matched points of interest used for the computation of the fundamental matrix. 
We have used here a bilinear interpolation based on a Delaunay triangulation. 
The resulting field is projected on the top level of the pyramid to provide an in- 
itial disparity field for the coarsest resolution level with respect to the associated 
epipolar geometry (projection on the associated epipolar lines (see fig.|2|)). 

3 Segmentation 

As our final goal is to provide a 3D reconstruction of the scene easy to handle, 
we now introduce a segmentation method of the dense disparity field obtained 
at the previous step. The method we propose is based on a adaptive triangu- 
lar mesh structure. The idea of our technique consists in recursively splitting 
an initial mesh until each triangular element corresponds to a 3D planar ele- 
ment. The associated splitting criterion is based on the homographic parametric 
model-description of the disparity field. It can be easily shown that, according to 
a pinhole camera model, the disparity associated with a planar surface projec- 
ted respectively as 7Ti in the first image and II 2 in the second image satisfies an 
homographic model. This model is linear using homogeneous coordinates. For 
sake of clarity, all the following expressions are meant to be expressed in homo- 
geneous coordinates. The homographic model links two corresponding points s 
and s -|- dg of 77i and II 2 with a 3 x 3 homogeneous homography matrix named 
iL up to a scalar factor /r: 

Vs e Til, TVs = fj.{s + dg). 

The segmentation step we propose consists thus in triangulating the dispa- 
rity map until the disparity vectors associated with each patch correspond to 
a single representative homographic model. An initial Delaunay triangulation is 
first performed by taking four arbitrary points near to the corner of the image. 
This triangulation is then refined until each triangle verifies a distance criterion 
between the dense estimation disparity and an homographic model estimated 
within the considered triangle. 
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3.1 Homography Estimation 

The homography estimation is performed using a method proposed by Robert 
and Faugeras m- The method relies on the epipolar geometry to efficiently 
estimate the homography matrix from three or more corresponding pairs of 
points. 

For H to be consistent with epipolar geometry, the homogeneous symmetric 
matrix F^H + H^F must be null. This leads to 6 homogeneous equations with 
unknowns hij (the coefficient of iJ). In our case, each point of a considered 
triangle T accounts for one scalar equation, we have therefore the following 
system of equation: 

Vs G r, [s + dg, Fs, Hs] = 0, 

where [a,b,c] denotes the triple product. 

This over-constrained system can be rewritten in matrix notation as Ah. = 0, 
where h is a 8 components vector gathering the unknown coefficients of and 
A is a (|{s G T}| -1-6) x 8 matrix. An estimate of h is computed using a SVD 
(singular value decomposition) of the matrix A* A. 

To be robust to problematic situations where the estimated disparities are 
likely to be biased or erroneous (such as occlusion areas or range discontinuities) , 
we exclude from this system points which are not simultaneously in accordance 
with the data model and the smoothing model (points for which the data outliers 
and the discontinuity weights approach zero). 



3.2 Splitting Criterion 

The distance criterion we chose to handle the splitting of the triangular mesh is 
decomposed in two terms: 

— The first one measures the adequacy of F[ to the disparity field. The influence 
of each point s of the triangle is weighted by the data model weight Sg coming 
from the robust estimator associated with the data model of the dense dispa- 
rity estimator (occlusion areas do not influence the distance measurement). 
The resulting adequacy term is given by: 

Ci(T, H, d) = ^ ^ Sg[\\Hs - (s + d«)f + \\H-\s + ds) - sf ], 

(9) 



where || || denotes the Euclidean distance. 

— The second term is related to the presence of disparity discontinuities within 
the considered triangle. This term is defined as the mean of discontinuity 
weights included in the considered triangle. It is expressed as follows: 



C2{t,p) = 



SIi<s,r>gCj. 



A 



\Ct\ 



Ct = {< s,r >,s G T,r G T},Ct C C 



( 10 ) 



where < s,r > denotes neighboring pixel of image 1. 
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More precisely, a given triangle T is split if the global criterion C'i(T, i/, d) + 
7(^2 (2^! /3) exceeds a given threshold e. The parameter 7 is an arbitrary fixed 
positive constant. 

3.3 Triangulation Refinement 

A given triangle is refined by adding a point located at its “center of mass”; 
the mass of each point being given by the value of their associated data outliers 
weights. The iterative triangulation refinement is performed until the distance 
measure C{T) computed for each triangle decreases below e. 

4 3D Reconstruction 

So far we have obtained a 2D triangulation of the first image and an associated 
disparity information. The last step of our method consists in recovering 3D 
information in order to build the final 3D model. To that end, the calibration 
parameters of the cameras have to be estimated. As the aim is not an accurate 
reconstruction but a visually satisfactory 3D representation we have used a sim- 
plified self-calibration technique. This approach consists to fix to some arbitrary 
values the intrinsic parameters (represented by the A matrix) and then to esti- 
mate the extrinsic parameters. The intrinsic parameters are chosen in order to 
respect the following assumptions: the projection of the optical center is suppo- 
sed to be at the center of the image, coordinate image axes are perpendicular, 
horizontal and vertical pixel sizes are fixed and equal to one and the focal length 
is assigned to a realistic value. The fundamental matrix F allows to access to 
the essential matrix E. This matrix only depends on the extrinsic parameters 
composed of the rotation matrix R and the translation vector between the first 
and the second camera location: 

E = A^FA=[t]^R, (11) 

where [t] x is an antisymmetric cross product matrix associated to the translation 
vector t. As shown by Tsai and Huang in the essential matrix can be 
decomposed in order to recover rotation and translation parameters. Using a 
singular value decomposition, E can be written as follows: 

E=AEO\ (12) 

where A and O are two orthogonal matrices and A is a diagonal matrix. It can be 
shown that an essential matrix has one null singular value, while the two others 
have the same value (they can be assigned to 1 because of the homogeneous 
property) [Ej. Matrix E can thus be rewritten as follows: 

E' = TiRi 

/O 0 0\ /I 0 0 \ 

= 0 0 1 0 0 -1 
\0 -1 0 / \0 1 0 ) 



( 13 ) 
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Injecting this decomposition in equation da leads to write the matrix E as 
a product of an antisymmetric matrix by an orthogonal one. By identification 
with equation we can extract rotation and translation matrices: 

E = ATiA 

[t]x R 

We must notice that this decomposition is not unique. The rotation is only 
defined up to tt while the translation is defined up to a scalar factor. The adequate 
pair of matrices is obtained by ensuring a 3D reconstructed point to be in view 
of the camera. 

The resulting intrinsic and extrinsic parameters lead to two projection ma- 
trices. The nodes of the triangulation are then re-projected into the 3D space 
by solving the system Q according to the dense disparity field. The 3D mesh is 
coded in the VRML language to allow real time interactivity. To avoid texture 
projection artifacts due to affine mapping, a preliminary simple correction is pro- 
cessed (an homographic transformation is performed to set the texture collinear 
to the image plane). 

5 Results 

The proposed method has been applied on different kinds of image sequences. It 
has been run both on real world sequences and synthetic sequences for which a 
ground truth exits. 

The first sequence we are considering here is the well known synthetic “Yose- 
mite sequence” (fig.OI). In order to satisfy the rigidity assumption, a major part 
of the sky containing moving clouds has been removed. Two different image pairs 
of this sequence have been considered. In the first one, which is composed of two 
consecutive images (images 9 and 10) the small range of the displacements (not 
more than 4 pixels) makes critical the estimation of the epipolar geometry. The 
second image pair, composed of far apart images in the sequence, constitutes 
a difficult benchmark towards the differential aspect of our method (up to 30 
pixels of displacement). 

As expected and shown on the recovered disparity map (fig.Otl), the dispari- 
ties are larger in the mountain area in the foreground and continuously decreases 
while we move towards the valley. The global aspect of this map is in accordance 
with what could be expected from visual inspection. 

Following we provide quantitative comparative results on this pair of 
images. Angular deviations with respect to the actual flow field have been com- 
puted. Table 0 lists the mean angular value error and associated the standard 
deviation. It gathers some results presented in 0, and by other authors (only 
the higher and the lower mean square error obtained by state of the art methods 
are presented in comparison with the classical Horn and Schunck algorithm). 
Let us note, we report here only performances of similar algorithms (energy ba- 
sed dense estimators) . Other results of more complex method combining motion 
estimation with a joint segmentation may be found in the literature. As may 
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Fig. 3. Original images 3 (a), 11 (b) and 12 (c) of the “Yosemite” sequence; disparity 
map for images 11 and 12 (d) and 3 and 12 (e) (the darker the smaller the disparity 
value) and final triangulation (f); reconstructed images for translation along Z axis 
(small (g), important (h)) and for a complex motion (viewpoint on the left of the 
foreground mountain (i) 



Technique 


Mean error 


Standard deviation 


Horn and Schunck U 


9.78° 


16.19° 


Black | 2 | 


3.52° 


3.25° 


Lai and VemurifTni 


1.99° 


1.45° 


Our method 


4.82° 


3.27° 



Table 1. Comparative results on Yosemite 



be observed, compared to others our method yields to a higher angular discre- 
pancy. Let us note meanwhile, it stays satisfactory. A few remarks must be done 
at this point. First at the opposite of the best methods mentioned in the table, 
our method uses a simple iterative solver (Gauss-Seidel). It could be therefore 
improved by using more efficient solvers. Second, it must be pointed out, that 
our method is a one-dimentional method. It is therefore far more faster than the 
others. Besides, due to small motion the epipolar geometry is quite difficult to 
estimate accurately. 

Let us now consider the second sequence, composed of far apart images (ima- 
ges 3 and 12) of the “Yosemite” sequence. Experiments on this sequence have 
shown that, due to the presence of very large displacements (up to 30 pixels 
of displacement), non constrained optical flow estimators (even embedded in a 
multiresolution framework) do not converge towards acceptable solutions. As 



Geometric Driven Optical Flow Estimation 



861 



shown in the disparity map presented figure Ot our method provides consistent 
results. The foreground mountain is characterized by important disparity values 
whereas in the background, disparities decrease smoothly. The dense disparity 
field estimation performs well for an image presenting both small and large dis- 
placements. The resulting field is globally smooth but presents discontinuities 
on important depth changes. 

The disparity field computed from images 3 and 12 has been then iterati- 
vely triangulated to obtain a 2D model of the valley. The associated VRML 
model has been computed by arbitrarily fixing the focal length to 1000. Figure El 
presents some interpolated and extrapolated images. The camera displacement 
along the z-axis is not far away from the real 3D motion in images Efe and Eh. 
The resulting images are visually satisfactory. Image El exemplifies more complex 
displacements illustrating occlusion problems. 




Fig. 4. Two original images of an indoor sequence (a) and (b) 




Fig. 5. Three synthesized views from the same view point: model computed directly 
from automatically extracted and matched points (a, (b and (c; same 3D motion simu- 
lation as previonsly with the model obtained by our method ((d corresponds to (a, (e 
to (b and (f to (c) 
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Some reconstruction results obtained for a static scene shot by a moving 
commercial camera (fig. are shown in figure 0 Two kind of reconstruction 
are presented here. The first one comes from the “image-matching” software, 
developed by Zhang ini, which gives a list of matching points of interest that 
respect the epipolar geometry. These points are triangulated and re-projected to 
obtain a 3D model. The examples presented in figure|5^|5jD and|3; are constructed 
from 89 automatically extracted matching points. The synthesized views outline 
the presence of outliers points that make the model visually uncomfortable. 
This effect can mostly be explained by the presence of spurious matches that 
respect both the epipolar geometry and the luminance consistency. The second 
3D model results from our algorithm (fig.lStI, Et andEF). A visual inspection of 
the reconstructed images shows far less artifacts for the same 3D displacements 
of the virtual camera. Such results could now be used in the context of video 
manipulation applications. 

6 Conclusion 

In this paper we have presented a method for the reconstruction of complex 
scene from a pair of uncalibrated images. This method relies on the estimation 
of a dense disparity field. The estimator proposed here is constrained by the 
epipolar geometry and incorporates robust function. We have experimentally 
demonstrated that the recovered fields are of good quality even in unfavorable 
case (very close views). The final 3D reconstruction is obtained through a seg- 
mentation process handled as a recursive adaptation of a triangular mesh. The 
outliers informations provided by the dense robust estimation are also used in 
the segmentation step to improve the quality of the final reconstruction. The 
efficiency of our approach has been validated on both polyhedral and non po- 
lyhedral complex scenes. The models obtained are sufficiently good to be used 
in a comfortable way in the context of video manipulation applications. Nevert- 
heless, more accurate results could be expected using the best self-calibration 
method available in the literature. A natural extension of our algorithm would 
consist in considering the trifocal tensor (associated with three images m) in- 
stead of the fundamental matrix F, to avoid many degenerate estimation cases 
of F. This could naturally lead to take into account more than two images to 
improve the VRML model quality. 
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Abstract. This paper addresses the problem of motion estimation and reconstruc- 
tion of 3D models from profiles of an object rotating on a turntable, obtained from 
a single camera. Its main contribution is the development of a practical and accurate 
technique for solving this problem from profiles alone, which is, for the first time, 
precise enough to allow the reconstruction of the object. No correspondence between 
points or lines are necessary, although the method proposed can be equally used when 
these features are available, without any further adaptation. Symmetry properties of 
the surface of revolution swept out by the rotating object are exploited to obtain the 
image of the rotation axis and the homography relating epipolar lines, in a robust 
and elegant way. These, together with geometric constraints for images of rotating 
objects, are then used to obtain first the image of the horizon, which is the projec- 
tion of the plane that contains the camera centres, and then the epipoles, thus fully 
determining the epipolar geometry of the sequence of images. The estimation of the 
epipolar geometry by this sequential approach (image of rotation axis — homogra- 
phy — image of the horizon — epipoles) avoids many of the problems usually found 
in other algorithms for motion recovery from profiles. In particular, the search for 
the epipoles, by far the most critical step, is carried out as a simple one-dimensional 
optimisation problem. The initialisation of the parameters is trivial and completely 
automatic for all stages of the algorithm. After the estimation of the epipolar geom- 
etry, the Euclidean motion is recovered using the fixed intrinsic parameters of the 
camera, obtained either from a calibration grid or from self-calibration techniques. 
Finally, the spinning object is reconstructed from its profiles, using the motion esti- 
mated in the previous stage. Results from real data are presented, demonstrating the 
efficiency and usefulness of the proposed methods. 



1 Introduction 

Methods for motion estimation and 3D reconstruction from point or line correspondences 
in a sequence of images have achieved a high level of sophistication, with impressive results 
ri2l8l . Nevertheless, if corresponding points are not available the current techniques cannot 
be applied. That is exactly the case when the scene being viewed is composed by non- 
textured smooth surfaces, and in this situation the predominant feature in the image is the 
profile or apparent contour of the surface o. Besides, even when point correspondences 
can be established, the profile still offers important clues for determining both motion and 
shape, and therefore should be used whenever available. 

This work presents a method for motion estimation and reconstruction of an object 
rotating around a fixed axis from information provided by its profiles. It makes use of sym- 
metry properties of the surface of revolution swept out by the rotating object to overcome 
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the main difficulties and drawbacks present in other methods which have attempted to es- 
timate motion from apparent contours, namely: the need for a very good initialisation for 
the epipolar geometry and an unrealistic demand of a large number of epipolar tangencies 
if^TTEI (here as few as two epipolar tangencies are needed), restriction to linear motion IfTEII 
(whereas circular motion is a more practical situation), or the use of an affine approximation 
II14I22II (which may be used only for shallow scenes). After obtaining the motion, the re- 
construction can be achieved by a simple technique, based on the epipolar parameterisation 
0, which extends the common triangulation methods from points to profiles. 

The first attempts to approach the problem of motion estimation from apparent con- 
tours date back to Rieger, in 1986 lit 7l . who introduced the concept of frontier point, in- 
terpreted as “centres of spin” [jfc] of the image motion. The paper dealt with the case of 
fronto-parallel orthographic projection, which is a rather restrictive situation. This idea was 
further developed by Porrill M, who recognised the frontier point as a fixed point on the 
surface, corresponding to the intersection of two consecutive contour generators 0|. The 
connection between the epipolar geometry and the frontier points was established in lTi7)l . 
and an algorithm for motion estimation from profiles was introduced in Q. 

Related works also include Q, where a technique based on registering the images using 
a planar curve was first developed. This method was implemented in [7'1, which also showed 
results of reconstruction from the estimated motion. In Ql the algorithm presented in 0 
is specialised to the affine case. 

The first steps towards a solution for the problem of reconstruction from apparent con- 
tours with known camera motion were given by Barrow and Tenenbaum, in 1981 ,where 
a technique to compute surface normals was introduced. Koenderink lit established re- 
lations between the differential geometry of a surface and the differential geometry of its 
profiles. This work was extended in (^, where algorithms for computing the curvature of a 
surface from its profiles were developed and implemented for orthographic projection. 

In GOl a reconstruction method based on parameterising the surface by radial curves 
was developed. Better results can be achieved by using an epipolar parameterisation, to- 
gether with an interpolation using the osculating circle, as introduced in [0|. Further re- 
finements were obtained in |0|, and a simple technique was developed in based on 
a finite-difference implementation of 0. Despite its simplicity, the method developed in 
renders results comparable to those in H and @, and was thus the technique chosen 
to be used here. 

An interesting comparison can be made between the work presented here and [iSj. Both 
papers tackle the same problem, but while in H hundreds of points are tracked and matched 
for each pair of adjacent images, it is shown here that a solution can be obtained even when 
only two epipolar tangencies are available, with at least comparable results. 

SectionElpresents a summary of the theoretical background and notation used in the re- 
maining of the paper. It reviews the symmetry properties of images of surfaces of revolution 
related to the harmonic homology, and presents two useful parameterisations of the funda- 
mental matrix. These parameterisations allow the estimation of the epipoles to be carried 
out as independent one-dimensional searches, avoiding points of local minima. This greatly 
reduces the computational complexity of the estimation. Sectional presents the algorithm 
for motion recovery, and the implementation of the algorithm for real data is shown in Sec- 
tion 0 which also makes comparisons with previous works. The reconstruction technique 
is described in Sectional together with experimental results for reconstruction. 
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2 Theoretical Background 

This section is a concise review of the mathematical background necessary for the rest of 
the paper, and only the main results will be presented. Details of the derivations can be 
found in O. 



2.1 Symmetry Properties of Images of Surfaces of Revolution 



A 2D homography that keeps the pencil of lines through a point u and the set of points on 
a line 1 hxed is called a perspective collineation with centre u and axis 1. A homology is a 
perspective collineation whose centre and axis are not incident (otherwise the perspective 
homology is called elation). Let x be a point mapped by an homology onto a point x' 
and let q be the line passing through these points. The point of intersection of q and 1 is 
denoted by v. If x and x' are harmonic conjugates with respect to u and v, i.e., their cross- 
ratio is one, the homology is said to be a harmonic homology (see details in im Chapter 
IX]). A curve or set of points invariant to a harmonic homology will be henceforth called 
harmonically symmetric. 

Consider an object rotating about a fixed axis. The surface of the object sweeps out a 
surface of revolution S. The image of S taken by a pinhole camera P is a curve s. Let Ig be 
the image of the axis of rotation of the surface S in the camera P. The optical centre of P 
and the axis of rotation define a plane S', whose normal direction is n^. The image of the 
point at infinity in the direction is the vanishing point v^;. 

If Vx and Is are represented in homogeneous coordinates, the 2D collineation W given 



by 



W = I-2 






( 1 ) 



is a harmonic homology, and s is harmonically symmetric with respect to W. It is worth re- 
membering that W is an involutary matrix, i.e., = I. It can be shown that if the camera 

P points towards the axis of rotation, the harmonic homology W reduces to a skew symme- 
try transformation, and the curve s will simply be skew symmetric about Ig. Furthermore, 
if the camera aspect ratio is one and the skew is zero, the skew symmetry transformation 
becomes a mirroring, and the curve s will be bilaterally symmetrical about Ig, as shown in 
Figure Ql 



2.2 Parameterisations of the Fundamental Matrix 

Consider a pair of camera matrices Pi and P 2 related by a rotation with an angle 9 0 

about an axis a not passing through their optical centres, represented as the matrix R®. 
The image of the plane containing the optical centres of the cameras and orthogonal to the 
axis a is the horizon, and it is represented as the line Ih in homogeneous coordinates. The 
fundamental matrix F relating Pi and P 2 is given by (see EUHl) 

F = [vx]x + A:tan^(lglh -f IhlJ), (2) 

with ijva; = 0, using the notation of Section im The parameter k is unknown but fixed for 
any angle 9, and cannot be obtained from two images alone. This should be expected, since 
the terms in (0 are in homogeneous coordinates, and thus defined only up to arbitrary scale 
factors. 
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(e) 

Fig. 1. Lines joining points which are symmetric about the image of rotation axis L (images are 
scaled and translated independently for better observation), (a) The optical axis points directly to- 
wards the rotation axis, (b) The camera is rotated about its optical centre by an angle p of 20° in a 
plane orthogonal to the rotation axis, (c) p — 40°. (d) p = 60°. (e) Same as (d), but the vanishing 
point Va; is also shown. 



From O) it is easy to prove that the epipole e^, formed in the image of camera P^, is 
given by q 

e* = Va; - (-l)*A:tan -[ls]xlh- (3) 

From (E) it can be seen that all the epipoles lie on the horizon Ih, independently of the value 
of 6. It can also be shown that the parameterisation given by 0 is equivalent to 

F = [02]xW, (4) 

where W is given by o, and, moreover, ei = We 2 . The result in 10 shows that there is a 
plane in space that induces the homology W. The proof of the following theorem does not 
appear anywhere else, and it will be shown here in more detail. 

Theorem 1. The planar homology W relating the cameras Pi and P 2 with 9 ^ nir, 
n € h, is induced by the plane S that contains the axis of rotation a and bisects the 
segment joining the optical centres of the cameras. 

Proof The existence and uniqueness of S satisfying the hypothesis of the Theorem are 
trivial. Let xi = [1 0 0]"'", X 2 = [0 1 0]"^, and X 3 = [0 0 I]"”". Without loss of generality, let 

Pi = KR[I I X 3 ] and 

P 2 = KR[R^ 1 x 3 ], (5) 

where K is the intrinsic parameters matrix of Pi and P 2 , R is the rotation matrix relating 
the orientation of the coordinate system of Pi to the world coordinate system, and R® is a 
rotation by 0 about the y-axis of the world coordinate system, i.e.. 
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R 



g 

V 



cos 6 0 sin 9 
0 10 
— sin 6 0 cos 9 



(6) 



Therefore, Va,/3 G IR, the point X = [— asin(0/2) f3 acos{9 /2)\^ lies on r;. Projecting 
X using Pi and P 2 , one obtains Ui = KR(X + X3) and U2 = KR(R®X + X3). Since 



R«X = 



Ofsin0cos(0/2) — acos0sin(0/2) 
asin0sin(0/2) + a cos 0 cos(0/2) 



asin(0/2) 




■-1 0 o' 


P 


= 


0 10 


acos(0/2) 




0 0 1 



X, 



(7) 



or RyX = (I — 2xixJ)X, we have U 2 = KR[(I — 2xixJ)X + X3], or U 2 = (I — 
2KRxixJR^^K^^)ui. It can be shown 1 1511 that KRxi = v^; and x^R'^K'^ = 1^, 
and thus the result follows. □ 



2.3 Epipolar Geometry and Apparent Contours 

Consider a surface S of type viewed by two pinhole cameras Pi and P 2 . The following 
definitions are presented as a quick review: 

- a contour generator associated with the surface S and the camera Pi corresponds to 
the space curve C C S such that for all points c G C the line passing through the optical 
centre of Pi and c is tangent to 5 at c; 

- the image of the contour generator associated with a camera Pi on this same camera 
is a profile or apparent contour; 

- if two contour generators associated with the surface S and the cameras Pi and P 2 
intersect, the points of intersection are denoted frontier points; 

- the epipolar plane U defined by the optical centres of the two cameras Pi and P 2 and 
the frontier point is tangent to the associated surface S; 

- the epipolar lines corresponding to the epipolar plane II are tangent to their associated 
apparent contours and are called epipolar tangents; 

The tangent point of associated epipolar tangencies corresponds to the image of the 
same point on the surface S, namely the frontier point. All the above definitions can be 
better understood by looking at Figure Q 



3 Motion Estimation 

Consider an object that undergoes a full rotation around a fixed axis. The envelope e of its 
profiles is found by overlapping the images of the sequence and applying a Canny edge 
detector to the resultant image (Figure 0^b)). This envelope corresponds to the image of a 
surface of revolution, and thus it is harmonically symmetric. The homography W related 
to e is then found by sampling N points x^ along e and optimising the cost function 
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frontier point 




epipolar tangency 
// apparent contour 



camera center 



epipole 



Fig. 2. The frontier point is a fixed point on the surface, corresponding to the intersection of two 
contour generators. The epipolar lines corresponding to the frontier point are tangent to the profile. 



where dist(e, W (v^;, ls)xi) is the distance between the curve e and the transformed sample 
point W(va;,ls)xi. 

The initialisation of the line Ig and the point can be made very close to the global 
minimum by automatically locating one or more pairs of corresponding bitangents on the 
envelope. The estimation of W is summarised in AlgorithmQl 



Algorithm 1 Estimation of the harmonic homology W. 
overlap the images in sequence; 

extract the envelope e of the profiles using a Canny edge detector; 
sample N points Xi along e; 

initialise the axis of symmetry h and the vanishing point using hitangents 
while not converged do 

transform the points Xi using W ; 

compute the distances between e and the transformed points; 
update la and to minimise the function in (HI; 

end while 



After obtaining a good estimation of W, one can then search for epipolar tangencies 
between pairs of images in the sequence using the parameterisation given by 0 . To obtain 
a pair of corresponding epipolar tangents in two images, it is necessary to find a line tangent 



N 




(8) 
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Fig. 3. (a) Image 1, 8, 15 and 22 in the sequence of 36 images of a rotating vase, (b) Envelope of 
apparent contours produced by overlapping all images in the sequence, (c) Estimation of the image 
of the rotation axis. 




Fig. 4. A pair of images of an object undergoing circular motion with a rotation of 80° is shown 
in (a) and (b). The overlapping of the two images can be seen in (c). Corresponding epipolar lines 
intersect at the image of the rotation axis, and all epipoles lie on a common horizon. 



to one profile which is transformed by onto a line tangent to the profile in the other 

image (see Figure HJl. The search for corresponding tangents may be carried out as a one- 
dimensional optimisation problem. The single parameter is the angle a that dehnes the 
orientation of the epipolar line 1 in the first image, and the cost function is given by 

/„ = dist(W-Ti(a),l'(a)), (9) 

where dist(W^^l(o;), l'| (a)) is the distance between the transformed line 1' = W^^l and 
a parallel line l'| tangent to the profile in the second image. Typical values of a lie between 
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-0.5 rad and 0.5 rad, or —30 and 30 . The shape of the cost function © for the profiles in 
Figure 131 can be seen in Figure 0 
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(a) (b) 

Fig. 5. Plot of the cost function B for a pair of images in the sequence. (a)/(b) Cost function for a 
pair of corresponding epipolar tangents near the top/bottom of the profile in Figure 0 



Algorithm 2 Estimation of the orientation of the epipolar lines. 

extract the profiles of two adjacent images using a Canny edge detector; 
fit b-splines to the top and the bottom of the profiles; 
initialise a; 

while not converged do 
find 1, 1' and lj| ; 

compute the distance between 1' and l'| ; 
update a to minimise the function in 

end while 



The epipoles can then be computed as the intersection of epipolar lines at the same 
image. After obtaining this first estimate for the epipoles, the image of the horizon can then 
be found by robustly fitting a line Ih to the initial set of epipoles, such that ijvj;. 

An alternative method to compute the epipoles is to register the profiles using the ho- 
mology W, eliminating the effects of rotation on the images, and then apply any of the 
methods in im™ , in a plane -i- parallax approach. However, no advantage has been ob- 
tained by doing so, since to use this method it is necessary to search for a common tangent 
between two profiles, which involves a search at least as complex as the one in Algorithmic 
Figure 0 shows a typical output of Algorithmic together with the horizon Ih fitted to 
the epipoles. After estimating the horizon, the only missing term in the parameterisation 
of the fundamental matrix shown in (01 is the scale factor A: tan 0/2. This parameter can 
be found by, again, a one-dimensional search that minimises the geometric error of trans- 
formed epipolar lines as shown in Fig0 
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(a) (b) 



Fig. 6. Geometric error for transformed epipolar lines, with the scale factor k tan 0/2 in Q set to 
100, for better visualisation. The terms Va,, R and Ih were obtained from Algorithm ^ and Algo- 
rithm|3 The solid lines in each correspond to tangents to the profile passing through the epipoles, 
and the dashed lines correspond to lines transferred from the one image to the other by applying 
the harmonic homology W. The distance between transformed lines and the corresponding tangent 
points is the cost function that drives the search for the scale factor k twiQ j 2 in 0. 



4 Implementation and Experimental Results 

The algorithms described in the previous session were tested using a set of 36 images of a 
vase placed on a turntable (see FigureQJa)) rotated by an angle of 10° between successive 
snapshots. To obtain W, Algorithm ID was implemented with 100 evenly spaced sample 
points along the envelope (N = 100). Bitangents were used to find an initial guess for 
homology W. Less then 10 iterations of the Levenberg-Marquadt algorithm are necessary, 
with derivatives computed by finite differences. The final configuration of the rotation axis 
can be seen in FigureOl^c). 

In the implementation of Algorithm^ 70 pairs of images were selected by uniformly 
sampling the indexes of the images, and the resultant estimate for the epipoles is shown in 
Figure El which also shows the horizon Ih found by a robust fit. To get Ih a minimisation 
of the median of the squares of the residuals was used, followed by removal of outliers 
and orthogonal least-squares regression using the remaining points (inliers). The epipolar 
geometry was then re-estimated with the epipoles constrained to lie on Ih. The resulting 
camera configurations are presented in Figure|Hl 




Fig. 7. Epipoles estimated by Algorithm E] The horizon is found by doing a robust fit to the cloud 
of epipoles. 
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Fig. 8. Lateral and top view of the estimated configuration of the cameras. The technique to recon- 
struct the object shown at the bottom in (a) and in the centre in (b) is described in Sectional 



The object was rotated on a manual turntable with resolution of 0.01°, but the real 
precision achieved is highly dependent on the skills of the operator. The RMS error in the 
estimated angles is less than 0.2°, as can be seen from Figure|3 demonstrating the accuracy 
of the estimation. 

It is interesting to compare this result with the ones shown in 10 pg. 166] for the “Head”, 
“Freiburg” and “Dinosaur” sequences, where the average number of point matches per 
image pair varies from 137 to 399, depending on the sequence. It should be stressed that 
only two epipolar tangents were used for each pair of images in the experiments presented 
in this paper, with comparable results. 





(a) 



Image index 

(b) 



Fig. 9. Estimated angle of rotation between successive views. The RMS error is 0.2° , for a maximum 
resolution of 0.01° for the manual turntable. 
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5 Reconstruction from Image Profiles 

The algorithm for motion estimation introduced here can perfectly be used even when point 
correspondences can be established. On the other hand, methods as the ones in |H] and M 
cannot deal with situations where profiles are the only available features in the scene, and 
it is therefore natural to use the motion recovered by the technique shown in this paper to 
the problem of reconstruction from apparent contours. To solve this problem under known 
motion, the main algorithms can be found in imum . Results reported in compare 
the last three, and although it slightly favours the one in jij], the simplicity of the method 
proposed in justifies its choice for evaluating the accuracy of the motion estimated 
here. 

5.1 Description of the Method 




Fig. 10. The correspondence between the points ui and U 2 is established via the epipolar parame- 
terisation. The result of the triangulation of ui and U 2 is not a point on the surface, but if the motion 
is small, the error will be negligible. 

The algorithm for reconstruction from apparent contours introduced in l2l is based on 
the assumption that, if the motion is small, the error in triangulating correspondences on 
images of successive contour generators, established via the epipolar parameterisation, will 
be negligible (see FigureO. This corresponds to a finite-difference approximation of the 
technique shown in @ . A summary of the procedure is shown in AlgorithmEI 



Algorithm 3 Reconstruction from image profiles, 
for i = 1 to iV — 1 do 

sample M points Uj along the profile if image i\ 
for j = 1 to M do 

compute the epipolar line 1 at image i + 1 corresponding to the point Uj ; 
find the intersection u' of the line 1 with the profile in image i -I- 1; 
triangulate the points Uj and uj ; 

end for 
end for 
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5.2 Implementation and Experimental Results 

A B-spline was fitted to the left side of the profile in the sequence of images shown in Fig- 
ure|3a). From the top to the bottom, 1 8 points were sampled on the spline in the first image 
(see Figure fnia'l'l. from which the corresponding epipolar lines in the second image were 
computed, and associated points where then triangulated. The intersection of the epipolar 
lines with the profile at the second image is shown in Figure fTTT b'l. Since the points satisfy 
the epipolar constraint by construction, the triangulation will be exact, i.e., the rays asso- 
ciated with the points at the first image will exactly intersect the corresponding rays at the 
second image. As pointed in O, in this case the choice of triangulation method becomes 
irrelevant, and a simple least-squares solution was adopted. 





Fig. 11. (a) Points sampled at the first image, (b) Corresponding epipolar lines at the second im- 
age. The triangulation is carried out between a point in the first image and the intersection of its 
correspondent epipolar line and the profile in the second image. 

Figure0 shows the relative position of the reconstructed object. Incidentally, the cam- 
era is far away, making both the motion estimation and the reconstruction an even more 
challenging problem, since the most appropriate model to deal with such situations is the 
affine model, instead of the projective model used throughout this paper. Details of the 3D 
reconstruction of the object are shown in Figure[Oand Figure El 

6 Summary and Conclusions 

This paper introduces a novel technique for motion estimation from image profiles. It does 
not make use of expensive search procedures, such as bundle adjustment, although it natu- 
rally integrates data from multiple images. The method is mathematically sound, practical 
and highly accurate. From the motion estimation to the model reconstruction, no point 
tracking is required and it does not depend on having point correspondences beforehand. 

The convergence to local minima, a critical issue in most non-linear optimisation prob- 
lems, is avoided by a divide-and-conquer approach which keeps the size of the problem 
manageable. Moreover, a search space with lower dimension results in fewer iterations be- 
fore convergence. The quality of model reconstructed is remarkable, in particular if one 
considers that only the least possible amount of information has been used. 
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Fig. 12. Details of the reconstruction of the object in FigureOIa). The reconstructed model is smooth, 
even considering that the epipolar parameterisation is degenerate in the neighbourhood of the frontier 
points. The views in (a) correspond to an angle tp of 10° with respect to the y-axis. (b) tp = 0° . (c) 
tp = 170°. The original viewing direction, computed from the estimated motion, istp = 24.35°. 




Fig. 13. Reconstruction of the object in Figure|3a), showing the shaded surface. The view points 
are the same as in Figure^] 
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